Hi folks,

I’m happy to let you know that in today’s sprint planning meeting, the multimedia team agreed to take on this issue as our top priority this week.

Gilles will be leading this investigation and report back to us, as we identify practical solutions to this complex problem.

We pushed back some feature development on Media Viewer to make this happen, but are still working on the same release schedule for now.

Many thanks to everyone who helped address this emergency over the weekend — and for all your good advice on how to solve this issue!


Fabrice


On Apr 22, 2014, at 10:26 AM, Fabrice Florin <fflorin@wikimedia.org> wrote:

Dear Faidon,

Your point is well taken that a major outage should trump a feature release.

We will discuss this issue with the multimedia team in tomorrow’s sprint planning meeting and see if we can take it on right away. If we do, this could push back our release of Media Viewer in coming weeks.

For now, I have filed this high-priority ’spike' ticket for evaluation by our team. We will respond here and onwiki, once our team has had a chance to investigate possible solutions.

https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/482

Thanks again to everyone on your team for taking on this issue!

Regards as ever,


Fabrice

______________________________________

#482 Investigate solution for image scalers outage

Narrative
As a power user, I can upload large TIFF image files using GWToolset, so that others can view them without crashing the system.

Investigate possible solutions for the image scalers outage that took place over Easter weekend. 

User:Fæ uploaded hundreds of 100-200MB multipage TIFFs via GWToolset over the course of 4-5 hours (multiple files per minute), and then
random users/bots viewing Special:NewFiles, which attempts to display a thumbnail for all of those new files in parallel in realtime, and thus
saturating imagescalers' MaxClients setting and basically inadvertently DoSing them, as reported by Faidon.

The outage's symptoms seem to have been alleviated since, but the Commons/GLAM communities are waiting for a response from us to resume
their work. They've responded to our "pause" request and in turn requested our feedback at:
https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps

...which is also linked from:
https://commons.wikimedia.org/wiki/User_talk:F%C3%A6#Large_file_uploads
https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F

It appears this project is establishing the limits of what Commons can currently handle, and we invite ideas on how the strain on the servers for the large images involved can be reduced. 

According to Emmanuel, it seems that this problem might be fixed by using the VipsScaler for TIFF pictures and Greg has already worked on this and proposed a patch. But this patch has been waiting a review since 7 months:
https://bugzilla.wikimedia.org/show_bug.cgi?id=52045



On Apr 22, 2014, at 4:31 AM, Faidon Liambotis <faidon@wikimedia.org> wrote:

Fabrice,

I don't see how a feature release can be of a higher priority than
troubleshooting an outage but regardless:

The outage's symptoms seem to have been alleviated since, but the
Commons/GLAM communities are waiting for a response from us to resume
their work. They've responded to our "pause" request and in turn
requested our feedback at:
https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps
(see the large red banner with the stop sign at the bottom)

...which is also linked from:
https://commons.wikimedia.org/wiki/User_talk:F%C3%A6#Large_file_uploads
https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F

Sadly, I don't have much to offer them, as I previously explained. I
certainly wouldn't commit to anything considering your response on the
matter.

Could you communicate your team's priorities to Fæ and the rest of the
Commons/GLAM community directly?

Thanks,
Faidon

On Mon, Apr 21, 2014 at 08:57:39AM -0700, Fabrice Florin wrote:
Dear Faidon, Emmanuel and Guiseppe,

Thanks so much for investigating this issue so quickly and sharing the likely cause of the problem with us.

This quarter, our team’s top priority is to address serious issues related to Upload Wizard — and this seems like a good one for us to take on.

However, we are still in the process of releasing Media Viewer, which is likely to take most of our attention for the next few weeks.

So we may not be able to troubleshoot it right away. But we are filling tickets about this issue, so we can hit the ground running in early may.

Thanks again for your fine work, as well as for your patience and understanding.


Fabrice


On Apr 21, 2014, at 3:53 AM, Emmanuel Engelhart <emmanuel.engelhart@wikimedia.ch> wrote:

On 21.04.2014 12:05, Faidon Liambotis wrote:
On Mon, Apr 21, 2014 at 10:56:40AM +0200, Giuseppe Lavagetto wrote:
The problem resolved before I could get to strace the apache processes, so
I don't have more details - Faidon was investigating as well and may have
more info.

Indeed, I do: this had nothing to do with TMH. The trigger was Commons
User:Fæ uploading hundreds of 100-200MB multipage TIFFs via GWToolset
over the course of 4-5 hours (multiple files per minute), and then
random users/bots viewing Special:NewFiles, which attempts to display a
thumbnail for all of those new files in parallel in realtime, and thus
saturating imagescalers' MaxClients setting and basically inadvertently
DoSing them.

The issue was temporary because of
https://bugzilla.wikimedia.org/show_bug.cgi?id=49118 but since the user
kept uploading new files, it was recurrent, with different files every
time. Essentially, we would keep having short outages every now and then
for as long as the upload activity continued.

I left a comment over at https://commons.wikimedia.org/wiki/User_talk:Fæ
and contacted Commons admins over at #wikimedia-commons, as a courtesy
to both before I used my root to elevate my privileges and ban a
long-time prominent Wikimedia user as an emergency countermeasure :)

It was effective, as Fæ immediately responded and ceased the activity
until further discussion; the Commons community was also helpful in the
short discussion that followed.

Andre also pointed out that Fæ had previously began the "Images so big
they break Commons" thread at the Commons Village Pump:
https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F

As for the more permanent solution: there's not much we, as ops, can do
about this but say "no, don't upload all these files", which is
obviously not a great solution :) The root cause is an architecture
issue with how imagescalers behave with regards to resource-intensive
jobs coming in a short period of time. Perhaps a combination of
poolcounter per file and more capacity (servers) would alleviate the
effect, but ideally we should be able to have some grouping &
prioritization of imagescaling jobs so that large jobs can't completely
saturate and DoS the cluster.

Commons has big difficulties to deal with big TIFF files and this is a serious issue, in particular for Wikipedians in Residence. To me it looks like that using the Vipsscaler would help to fix the worse ones.

Here is an email I have sent to Andre and Greg a few days ago. I make it public with the hope it might help.

===========
As a GLAM volunteer and WIR at the Swiss National Library, I encourage institutions to upload high quality pictures to increase digital sustainability. But, in the worse case (big TIFF files), Commons is not able to deal with them and fails to compute the thumbnails.

You have a perfect example of the problem with this recently uploaded collection of historical plans of the Zurich main station:
https://commons.wikimedia.org/wiki/Category:Historical_plans_of_Zurich_Main_Station

It seems that this problem might be fixed by using the VipsScaler for TIFF pictures and Greg has already worked on this and proposed a patch. But this patch has been waiting a review since 7 months:
https://bugzilla.wikimedia.org/show_bug.cgi?id=52045

IMO it would be great if you could do something to increase the priority and the urgency of this ticket. The movement invests pretty much resources to build successful collaboration with GLAMs and many of them get braked by this "silly" bug.

Hope you can help us.
===========

--
Volunteer
Technology, GLAM, Trainings
Zurich
+41 797 670 398

_______________________________________________
Multimedia mailing list
Multimedia@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/multimedia

_______________________________

Fabrice Florin
Product Manager
Wikimedia Foundation

http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)




_______________________________________________
Ops mailing list
Ops@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops


_______________________________

Fabrice Florin
Product Manager
Wikimedia Foundation

http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)




_______________________________

Fabrice Florin
Product Manager
Wikimedia Foundation