On Sun, May 11, 2014 at 11:33 AM, Gergo Tisza <gtisza@wikimedia.org> wrote:
I think the short-term outcome was to throttle GWToolset until there is a better fix. There is a patch pending to do that:

I described the thinking behind the limits in this mail and the followups:
tl;dr it tries to limit the GWToolset-uploaded thumbnails appearing in Special:* at one time to 10% (5 with default settings), based on the total upload rate in the slowest hour of an average day. That's about one image per two minutes.

The core patch is merged now so we could backport and merge the config patch, and restart GWToolset uploads, in a few days, if we think the throttling is enough to prevent further outages.
That is a big if though - it is not clear that throttling would be a good way to avoid overloading the scalers.

My understanding is that there were three ways in which the NYPL map uploads were causing problems:

1. the scalers did not have enough processing power to handle all the thumbnail requests that were coming in simultaneously. This was presumably because Special:NewFiles andĀ Special:ListFiles were filled with the NYPL maps, and users looking at those pages sent dozens of thumbnailing requests in parallel.
2. Swift traffic was saturated by GWToolset-uploaded files, making the serving of everything else very slow. I assume this was because of the scalers fetching the original files? Or could this be directly caused by the uploading somehow?
3. GWToolset jobs piling up in the job queue (Faidon said he cleared outĀ 7396 jobs).

== Scaler overload ==

For the first problem, we can make an educated guess of the level of throttling required: if we want to keep the number of simultaneous GWToolset-related scaling requests below X, that means Special:NewFiles and Special:ListFiles should not have more than X/2 GWToolset files on them at any given time. Those pages show the last 50 files, so GWToolset should not upload more than X files in the time that takes normal users to upload 100 of them. I counted the number of uploads per hour on Commons on a weekday, and there were 240 uploads in the slowest hour, which is about 25 minutes for 100 files. so GWToolset should be limited to X files in 25 minutes, for some value of X that ops are happy with.

This is the best we can do with the current throttling options of the job queue, I think, but it has a lot of holes. The rate of normal uploads could drop extremely low for a short time for some reason. New file patrollers could be looking at the special pages with non-default settings (500 images instead of 50). Someone could look at the associated category (200 thumbnails at a time). This is not a problem if people are continuosly keeping watch on Special:NewFiles, because that would mean that the thumbnails get rendered soon after the uploads; but that's an untested assumption.

So I am not confident that throttling would be enough to avoid further meltdowns. I think Dan is working on a patch to make the upload jobs pre-render the thumbnails; we might have to wait for that before allowing GWToolset uploads again.

== Swift bandwidth overuse ==

This seems simple: just limit the bandwidth available for a single transfer; if the throttling+prerendering is in place, that ensures that there are no more than a set number scaling requests running in parallel. If the bandwidth use is still an issue after that, just come up with a per-transfer bandwidth limit such that even if the number of scaling requests maxes out, there is still enough bandwidth remaining to serve normal requests. (In the future, the bandwidth usage could be avoided completely by using the same server for uploading and thumbnail rendering, but that sounds like a more complex change.)
Gilles already started a thread about this ("Limiting bandwidth when reading from swift").

== Job queue filling up ==

I am not sure if this is a real problem or just a symptom; does this cause any issues directly?
At any rate, this seems like a bug in the code of the GWToolset jobs, which have some logic to bail out if there more than 1000 pending jobs, but apparently that does not work.

Any thoughts on this?