Thanks for the hotfix, Aaron, and the reply to my questions, Faidon. The multimedia team has allocated time this sprint to find a short-term and long-term plan for the general Image Scaler situation, and each subsequent sprint we'll have time allocated as well to act on that plan and implement it.
I'd like to have an engineering meeting with both of you and our team (invite sent), to help us fully understand the various parts involved. At the moment I think we all (multimedia folks) have a lot of catching up to do in terms of knowledge of that code, I don't think any of us would have been capable of writing the hotfix Aaron did. Our team needs to acquire that knowledge ASAP.
Now, let's not wait for the meeting to start discussing the issue and proposed solutions. As I've stated earlier in this thread, I would like to revisit the fact that the thumbnails are being generated in real time, because it's the bigger problem and Image Scalers will still be subject to going down if we don't fix that. I suspect that a solution to that broader problem would solve the GWToolset issue as well.
Aaron worked on wrapping thumb calls in PoolCounter on the MediaWiki side
lately as well
Am I correct to assume that the PoolCounter wrapping is used to dedupe requests from different image scaler servers to generate the same thumbnail?
I've thought about the problem since the incident and I came up with the following rough idea, that I'd like everyone to criticize/validate against their knowledge of the existing setup. Note that it's based on a surface understanding of things, I still don't know the existing code very well, which is why your knowledge applied against my idea will be very useful.
Instead of each image scaler server generating a thumbnail immediately when a new size is requested, the following would happen in the script handling the thumbnail generation request: - the client's connection would be kept open, potentially for a while. All the way from the end-user to the thumb script - a data store (what kind is TBD, whatever is most appropriate, redis might be a good candidate) would be connected to and a virtual resource request would be added to a queue. Basically the script would be queueing an "I need one unit of thumbnail generation" request object - in a loop, the same data store would be read every X milliseconds to check the request object's position in the queue - if the request object's position is below a certain configurable limit, the actual job starts, scaling the image and storing the thumb - when the script is done generating the thumbnail, it removes its request object from the queue, essentially liberating a virtual unit of thumbnail generation ork for another thumb request to use. And of course it returns the generated thumb as it currently does. - if the script dies for whatever reason, an expiry setting on the queued request object would kick in at the data store level, liberating the virtual unit automatically - if the client requesting the thumb gives up and closes the connection because there's a high load and they don't see the thumb appear fast enough to their taste, the script would keep running, wait for its turn in the queue and ultimately render the thumb anyway. This is crucial, because at times of high load when things start taking time, we definitely don't want users refreshing the page to result in aborting and restarting the exact same thumb generation work. Every thumb generation request should complete, even if much later and with no end-user with an open connection to see the result anymore.
Pros: - The limit on virtual units of work means that the image scaler server load should never go over a certain point. No more going down depending on the mix or quantity of thumb requests coming in. The servers would just be handling as much work as they actually can and no more. - This would be agnostic of how long a given thumbnail generation takes. Which means that someone uploading lots of large files that are very time-consuming to generate thumbs for, such as last weekend's incident, wouldn't take down image scalers. They'd just slow down thumb generation across the board. - The queue size could be configurable. The best strategy is to start low (eg. as many units as there are servers) - The queue could be smarter than a plain queue and have weight and client-based priority strategies. For example we could make it so that people uploading a lot of large images don't hog the queue to themselves. I don't see that as a requirement to solve the reliability issues, but it would be nice to have, and such a prioritization would be very relevant to the GWToolset user behavior. It would basically that someone uploading a lot of large images would only have their thumbs take longer to generate than usual, not everyone else's as well. - We could write "attack bots" that would attempt DoSing the image scalers in various ways and let us verify that this new system behaves well under heavy load. We know the weaknesses, we should be testing them and making sure that they're solved, instead of waiting for someone with bad intentions or some accident coming from unusual usage pattern to do that for us.
Cons: - Under high load, HTTP connections could potentially be held for a while. Maybe that's not such a big deal, I guess Ops has the answer. I think it's fair to set a time limit on that too, and still have the thumbnail generation happen even if we closed the connection to the client. From the end-user's perspective, the experience would be that they wait a while for a thumb to appear, the connection dies, but chances are that if they refresh the page a few seconds or minutes later, the thumb will be there. - During high load/DoS attack/whatever people viewing thumbs at new sizes will experience long image load times. Considering that most of the time, the thumb will end up appearing before a timeout occurs, I think that's preferable to the status quo, which is that image scalers would go down entirely in those situations.
On Tue, Apr 22, 2014 at 1:31 PM, Faidon Liambotis faidon@wikimedia.orgwrote:
Fabrice,
I don't see how a feature release can be of a higher priority than troubleshooting an outage but regardless:
The outage's symptoms seem to have been alleviated since, but the Commons/GLAM communities are waiting for a response from us to resume their work. They've responded to our "pause" request and in turn requested our feedback at: https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps (see the large red banner with the stop sign at the bottom)
...which is also linked from: https://commons.wikimedia.org/wiki/User_talk:F%C3%A6#Large_file_uploads
https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_b...
Sadly, I don't have much to offer them, as I previously explained. I certainly wouldn't commit to anything considering your response on the matter.
Could you communicate your team's priorities to Fæ and the rest of the Commons/GLAM community directly?
Thanks, Faidon
On Mon, Apr 21, 2014 at 08:57:39AM -0700, Fabrice Florin wrote:
Dear Faidon, Emmanuel and Guiseppe,
Thanks so much for investigating this issue so quickly and sharing the
likely cause of the problem with us.
This quarter, our team's top priority is to address serious issues
related to Upload Wizard -- and this seems like a good one for us to take on.
However, we are still in the process of releasing Media Viewer, which is
likely to take most of our attention for the next few weeks.
So we may not be able to troubleshoot it right away. But we are filling
tickets about this issue, so we can hit the ground running in early may.
Thanks again for your fine work, as well as for your patience and
understanding.
Fabrice
On Apr 21, 2014, at 3:53 AM, Emmanuel Engelhart <
emmanuel.engelhart@wikimedia.ch> wrote:
On 21.04.2014 12:05, Faidon Liambotis wrote:
On Mon, Apr 21, 2014 at 10:56:40AM +0200, Giuseppe Lavagetto wrote:
The problem resolved before I could get to strace the apache
processes, so
I don't have more details - Faidon was investigating as well and may
have
more info.
Indeed, I do: this had nothing to do with TMH. The trigger was Commons User:Fæ uploading hundreds of 100-200MB multipage TIFFs via GWToolset over the course of 4-5 hours (multiple files per minute), and then random users/bots viewing Special:NewFiles, which attempts to display
a
thumbnail for all of those new files in parallel in realtime, and thus saturating imagescalers' MaxClients setting and basically
inadvertently
DoSing them.
The issue was temporary because of https://bugzilla.wikimedia.org/show_bug.cgi?id=49118 but since the
user
kept uploading new files, it was recurrent, with different files every time. Essentially, we would keep having short outages every now and
then
for as long as the upload activity continued.
I left a comment over at
https://commons.wikimedia.org/wiki/User_talk:F%C3%A6
and contacted Commons admins over at #wikimedia-commons, as a courtesy to both before I used my root to elevate my privileges and ban a long-time prominent Wikimedia user as an emergency countermeasure :)
It was effective, as Fæ immediately responded and ceased the activity until further discussion; the Commons community was also helpful in
the
short discussion that followed.
Andre also pointed out that Fæ had previously began the "Images so big they break Commons" thread at the Commons Village Pump:
https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_b...
As for the more permanent solution: there's not much we, as ops, can
do
about this but say "no, don't upload all these files", which is obviously not a great solution :) The root cause is an architecture issue with how imagescalers behave with regards to resource-intensive jobs coming in a short period of time. Perhaps a combination of poolcounter per file and more capacity (servers) would alleviate the effect, but ideally we should be able to have some grouping & prioritization of imagescaling jobs so that large jobs can't
completely
saturate and DoS the cluster.
Commons has big difficulties to deal with big TIFF files and this is a
serious issue, in particular for Wikipedians in Residence. To me it looks like that using the Vipsscaler would help to fix the worse ones.
Here is an email I have sent to Andre and Greg a few days ago. I make
it public with the hope it might help.
=========== As a GLAM volunteer and WIR at the Swiss National Library, I encourage
institutions to upload high quality pictures to increase digital sustainability. But, in the worse case (big TIFF files), Commons is not able to deal with them and fails to compute the thumbnails.
You have a perfect example of the problem with this recently uploaded
collection of historical plans of the Zurich main station:
https://commons.wikimedia.org/wiki/Category:Historical_plans_of_Zurich_Main_...
It seems that this problem might be fixed by using the VipsScaler for
TIFF pictures and Greg has already worked on this and proposed a patch. But this patch has been waiting a review since 7 months:
https://bugzilla.wikimedia.org/show_bug.cgi?id=52045
IMO it would be great if you could do something to increase the
priority and the urgency of this ticket. The movement invests pretty much resources to build successful collaboration with GLAMs and many of them get braked by this "silly" bug.
Hope you can help us.
-- Volunteer Technology, GLAM, Trainings Zurich +41 797 670 398
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Fabrice Florin Product Manager Wikimedia Foundation
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia