SUMMARY: This week I experienced an issue when uploading several hundred very high resolution maps as part the NYPL maps project.[1] Discussion has been going on in several places and this thread is an attempt to share a discussion in one place so all users can benefit.
[Gilles, Could you join this low volume open email list to keep track of GWT issues and be a voice for WMF Operations to help us reach a recommendation for end user best practices?]
HISTORY For our GLAM projects my upload was unusually stressful for the WMF servers. Individual map scans are up to 300 MB images, and resolutions can exceed 80 megapixels (80 million pixels). There are 20,000 tiff images to be uploaded, I have completed around 12%. I used the GLAMtoolset at full capacity (20 threads) though I had broken the xml file up, so runs were a few hundred images at a time. My intention was to ramp this up to a couple of thousand per upload "tranche".
I was contacted on Tuesday by operations asking for me to suspend the upload as the demand for attempted thumbnail rendering of the tiff images was too high a load on WMF servers.[2] Over 500 of the tiff images were greater than 50 megapixels and as a consequence Commons fails to render any thumbnails (they are created for jpegs greater than this limit, this is a tiff specific constraint).[3]
CURRENT STATE With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
WAY FORWARD In the longer term the WMF is looking at alternatives for rendering tiff thumbnails which will enable 50MP+ images to be handled; this may or may not help solve the problem seen this week.[4]
I recommend that the GWToolset on-wiki guides include a recommendation about how to choose the number of processing threads based on the types of images to be uploaded. To date, no other project has seen these problems, probably because the image resolutions fall well under the 50MP threshold. The maximum allowed number of threads is 20, with a default being 10. For the time being I suggest that we agree a best practice that for upload projects with tiffs over 50MP, that no more than 2 threads are used; these problems do not appear to exist for projects uploading smaller resolution files.
I propose that WMF Operations consider finding ways of testing the peak loads possible from the GWT and decide if this can be fixed by future operational improvements, whether the tool might benefit from some simple "load management" changes, or if establishing a best practice for our (relatively) small number of GWT users would be a sufficient community based control.
Links 1. https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps 2. https://commons.wikimedia.org/wiki/Commons_talk:Batch_uploading/NYPL_Maps 3. https://commons.wikimedia.org/wiki/Category:NYPL_maps_%28over_50_megapixels%... 4. https://bugzilla.wikimedia.org/show_bug.cgi?id=52045
Fae
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote:
With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be.
GWToolset already has several throttles in place, http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be. _______________________________________________ Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On Mon, Apr 28, 2014 at 2:17 AM, dan entous d_entous@yahoo.com wrote:
wouldn't it be better to throttle the application/tool that generates thumbnails so that it doesn't try to produce too many thumbnails at once?
That would be the proper solution, yes. But not necessarily achievable in short-term.
Hi Dan,
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
The issue is that there is no application generating thumbnails at a given rate. Thumbnails are being generated on demand when people view a thumbnail that doesn't exist. And since Special:NewFiles exists, and is visited every few seconds by bots, that means all new uploads have their thumbnails generated almost on the spot. Thus, we can't slow down that part. We have several long-term tasks to improve this issue, but they will take months to implement. Our only option at the moment is to try and avoid having GWToolset make too many massive images appear on Common's Special:NewFiles in a short period of time.
Over 500 of the tiff images were greater than 50 megapixels and as a
consequence Commons fails to render any thumbnails
Indeed, it seems like some thumbnail generation requests timed out due to the size of these images. There are limits on the image scalers in regards to how long a thumbnailing job can take and these were going over the limit. To make matters worse, the current retry mechanism means that they were being retried 5 times, and thus using 5 times the resources. I would advise against trying to upload those enormous images for now, we should try to focus on a solution for the smaller images. It would be great if the next upload attempt leaves the images that are too large aside.
I think the safest option to proceed forward is to lower the appropriate GWToolset throttles in production and then schedule a time for Fae to try the upload process again. By scheduling a specific day and time for the next attempt, we can make sure that engineers and ops have eyes on the servers to watch the load. Then if things go well, we can tweak the throttles back to higher values.
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles...
,
The throttle documentation doesn't have any unit. I understand that it's "per background job run", but how often do these background jobs run?
I couldn't find configuration values for these throttles on Commons. Dan, can you confirm that Commons is using the default values?
On Mon, Apr 28, 2014 at 11:17 AM, dan entous d_entous@yahoo.com wrote:
GWToolset already has several throttles in place, http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail
rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On Apr 29, 2014, at 15:10 , Gilles Dubuc gilles@wikimedia.org wrote:
Hi Dan,
wouldn’t it be better to throttle the application/tool that generates thumbnails so that it doesn’t try to produce too many thumbnails at once?
The issue is that there is no application generating thumbnails at a given rate. Thumbnails are being generated on demand when people view a thumbnail that doesn't exist. And since Special:NewFiles exists, and is visited every few seconds by bots, that means all new uploads have their thumbnails generated almost on the spot. Thus, we can't slow down that part. We have several long-term tasks to improve this issue, but they will take months to implement. Our only option at the moment is to try and avoid having GWToolset make too many massive images appear on Common's Special:NewFiles in a short period of time.
Over 500 of the tiff images were greater than 50 megapixels and as a consequence Commons fails to render any thumbnails
Indeed, it seems like some thumbnail generation requests timed out due to the size of these images. There are limits on the image scalers in regards to how long a thumbnailing job can take and these were going over the limit. To make matters worse, the current retry mechanism means that they were being retried 5 times, and thus using 5 times the resources. I would advise against trying to upload those enormous images for now, we should try to focus on a solution for the smaller images. It would be great if the next upload attempt leaves the images that are too large aside.
I think the safest option to proceed forward is to lower the appropriate GWToolset throttles in production and then schedule a time for Fae to try the upload process again. By scheduling a specific day and time for the next attempt, we can make sure that engineers and ops have eyes on the servers to watch the load. Then if things go well, we can tweak the throttles back to higher values.
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles...,
The throttle documentation doesn't have any unit. I understand that it's "per background job run", but how often do these background jobs run?
what do you mean by unit? each config key in that section shows a default value to the right of it.
I couldn't find configuration values for these throttles on Commons. Dan, can you confirm that Commons is using the default values?
throttle config values ---------------------- the throttle configuration values are in the extension itself, conhttp://git.wikimedia.org/blob/mediawiki%2Fextensions%2FGWToolset.git/d27991c..., and can be overridden in the http://git.wikimedia.org/tree/operations%2Fmediawiki-config.git wmf-config/CommonSettings.php file in the if ( $wmgUseGWToolset ) { section.
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue. we could change those values, but that would be a pity for people uploading smaller file sizes. hopefully, we could instead make it clear to the uploader that if their file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
GWToolset\Config::$mediafile_job_throttle_default = new_value GWToolset\Config::$mediafile_job_throttle_max = new_value
job run frequency ----------------- how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
On Mon, Apr 28, 2014 at 11:17 AM, dan entous d_entous@yahoo.com wrote: GWToolset already has several throttles in place, http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be. _______________________________________________ Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
what do you mean by unit? each config key in that section shows a default value to the right of it.
I want to figure out how many background job runs we end up with per minute/per hour in practice. So I meant units such as X/minute, Y/hour. I know that it's dependent on how the background jobs are configured, but this throttle figures section of the documentation doesn't help figure that out. Makes it hard for anyone to pick a figure, because it's hard to know what the number represents.
hopefully, we could instead make it clear to the uploader that if their
file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
Guidelines sound like a good idea. If I'm following this logic correctly, though, doesn't that mean that there's also a risk that separate users might "step on each other's toes" in terms of resources, if they happen to be uploading content at the same time? Basically, if a given user sets a threshold which is a fine value for isolated use, isn't the risk that the threshold ends up being too high if more than one GWToolset user is uploading to Commons at the same time? At first I thought that the limit was on the Commons server side, but your remark seems to suggest that this is configured on the uploader's side.
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
CCing Aaron.
On Tue, Apr 29, 2014 at 4:19 PM, dan entous d_entous@yahoo.com wrote:
On Apr 29, 2014, at 15:10 , Gilles Dubuc gilles@wikimedia.org wrote:
Hi Dan,
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
The issue is that there is no application generating thumbnails at a
given rate. Thumbnails are being generated on demand when people view a thumbnail that doesn't exist. And since Special:NewFiles exists, and is visited every few seconds by bots, that means all new uploads have their thumbnails generated almost on the spot. Thus, we can't slow down that part. We have several long-term tasks to improve this issue, but they will take months to implement. Our only option at the moment is to try and avoid having GWToolset make too many massive images appear on Common's Special:NewFiles in a short period of time.
Over 500 of the tiff images were greater than 50 megapixels and as a
consequence Commons fails to render any thumbnails
Indeed, it seems like some thumbnail generation requests timed out due
to the size of these images. There are limits on the image scalers in regards to how long a thumbnailing job can take and these were going over the limit. To make matters worse, the current retry mechanism means that they were being retried 5 times, and thus using 5 times the resources. I would advise against trying to upload those enormous images for now, we should try to focus on a solution for the smaller images. It would be great if the next upload attempt leaves the images that are too large aside.
I think the safest option to proceed forward is to lower the appropriate
GWToolset throttles in production and then schedule a time for Fae to try the upload process again. By scheduling a specific day and time for the next attempt, we can make sure that engineers and ops have eyes on the servers to watch the load. Then if things go well, we can tweak the throttles back to higher values.
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles... ,
The throttle documentation doesn't have any unit. I understand that it's
"per background job run", but how often do these background jobs run?
what do you mean by unit? each config key in that section shows a default value to the right of it.
I couldn't find configuration values for these throttles on Commons.
Dan, can you confirm that Commons is using the default values?
throttle config values
the throttle configuration values are in the extension itself, conhttp:// git.wikimedia.org/blob/mediawiki%2Fextensions%2FGWToolset.git/d27991ca8168e47152605d73e41b2960333b470a/includes%2FConfig.php, and can be overridden in the http://git.wikimedia.org/tree/operations%2Fmediawiki-config.gitwmf-config/Co... file in the if ( $wmgUseGWToolset ) { section.
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue. we could change those values, but that would be a pity for people uploading smaller file sizes. hopefully, we could instead make it clear to the uploader that if their file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
GWToolset\Config::$mediafile_job_throttle_default = new_value GWToolset\Config::$mediafile_job_throttle_max = new_value
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
On Mon, Apr 28, 2014 at 11:17 AM, dan entous d_entous@yahoo.com wrote: GWToolset already has several throttles in place,
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail
rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Up to 16 jobs among all GWT job types can be picked at once (1 runner per 16 servers).
On Tue, Apr 29, 2014 at 8:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
what do you mean by unit? each config key in that section shows a default
value to the right of it.
I want to figure out how many background job runs we end up with per minute/per hour in practice. So I meant units such as X/minute, Y/hour. I know that it's dependent on how the background jobs are configured, but this throttle figures section of the documentation doesn't help figure that out. Makes it hard for anyone to pick a figure, because it's hard to know what the number represents.
hopefully, we could instead make it clear to the uploader that if their
file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
Guidelines sound like a good idea. If I'm following this logic correctly, though, doesn't that mean that there's also a risk that separate users might "step on each other's toes" in terms of resources, if they happen to be uploading content at the same time? Basically, if a given user sets a threshold which is a fine value for isolated use, isn't the risk that the threshold ends up being too high if more than one GWToolset user is uploading to Commons at the same time? At first I thought that the limit was on the Commons server side, but your remark seems to suggest that this is configured on the uploader's side.
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
CCing Aaron.
On Tue, Apr 29, 2014 at 4:19 PM, dan entous d_entous@yahoo.com wrote:
On Apr 29, 2014, at 15:10 , Gilles Dubuc gilles@wikimedia.org wrote:
Hi Dan,
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
The issue is that there is no application generating thumbnails at a
given rate. Thumbnails are being generated on demand when people view a thumbnail that doesn't exist. And since Special:NewFiles exists, and is visited every few seconds by bots, that means all new uploads have their thumbnails generated almost on the spot. Thus, we can't slow down that part. We have several long-term tasks to improve this issue, but they will take months to implement. Our only option at the moment is to try and avoid having GWToolset make too many massive images appear on Common's Special:NewFiles in a short period of time.
Over 500 of the tiff images were greater than 50 megapixels and as a
consequence Commons fails to render any thumbnails
Indeed, it seems like some thumbnail generation requests timed out due
to the size of these images. There are limits on the image scalers in regards to how long a thumbnailing job can take and these were going over the limit. To make matters worse, the current retry mechanism means that they were being retried 5 times, and thus using 5 times the resources. I would advise against trying to upload those enormous images for now, we should try to focus on a solution for the smaller images. It would be great if the next upload attempt leaves the images that are too large aside.
I think the safest option to proceed forward is to lower the
appropriate GWToolset throttles in production and then schedule a time for Fae to try the upload process again. By scheduling a specific day and time for the next attempt, we can make sure that engineers and ops have eyes on the servers to watch the load. Then if things go well, we can tweak the throttles back to higher values.
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles... ,
The throttle documentation doesn't have any unit. I understand that
it's "per background job run", but how often do these background jobs run?
what do you mean by unit? each config key in that section shows a default value to the right of it.
I couldn't find configuration values for these throttles on Commons.
Dan, can you confirm that Commons is using the default values?
throttle config values
the throttle configuration values are in the extension itself, conhttp:// git.wikimedia.org/blob/mediawiki%2Fextensions%2FGWToolset.git/d27991ca8168e47152605d73e41b2960333b470a/includes%2FConfig.php, and can be overridden in the http://git.wikimedia.org/tree/operations%2Fmediawiki-config.gitwmf-config/Co... file in the if ( $wmgUseGWToolset ) { section.
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue. we could change those values, but that would be a pity for people uploading smaller file sizes. hopefully, we could instead make it clear to the uploader that if their file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
GWToolset\Config::$mediafile_job_throttle_default = new_value GWToolset\Config::$mediafile_job_throttle_max = new_value
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
On Mon, Apr 28, 2014 at 11:17 AM, dan entous d_entous@yahoo.com
wrote:
GWToolset already has several throttles in place,
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF ops, I have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail
rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Up to 16 jobs among all GWT job types can be picked at once (1 runner per 16 servers).
Are they picked up at a steady frequency?
Basically, if each job treats less images in one run, will it truly space out each image's treatment in time? Or will it have no effect because each GWT job will complete faster, with the next GWT job being picked up right after the previous one on a given server completes?
On Tue, Apr 29, 2014 at 5:56 PM, Aaron Schulz aschulz@wikimedia.org wrote:
Up to 16 jobs among all GWT job types can be picked at once (1 runner per 16 servers).
On Tue, Apr 29, 2014 at 8:41 AM, Gilles Dubuc gilles@wikimedia.orgwrote:
what do you mean by unit? each config key in that section shows a default
value to the right of it.
I want to figure out how many background job runs we end up with per minute/per hour in practice. So I meant units such as X/minute, Y/hour. I know that it's dependent on how the background jobs are configured, but this throttle figures section of the documentation doesn't help figure that out. Makes it hard for anyone to pick a figure, because it's hard to know what the number represents.
hopefully, we could instead make it clear to the uploader that if their
file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
Guidelines sound like a good idea. If I'm following this logic correctly, though, doesn't that mean that there's also a risk that separate users might "step on each other's toes" in terms of resources, if they happen to be uploading content at the same time? Basically, if a given user sets a threshold which is a fine value for isolated use, isn't the risk that the threshold ends up being too high if more than one GWToolset user is uploading to Commons at the same time? At first I thought that the limit was on the Commons server side, but your remark seems to suggest that this is configured on the uploader's side.
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
CCing Aaron.
On Tue, Apr 29, 2014 at 4:19 PM, dan entous d_entous@yahoo.com wrote:
On Apr 29, 2014, at 15:10 , Gilles Dubuc gilles@wikimedia.org wrote:
Hi Dan,
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
The issue is that there is no application generating thumbnails at a
given rate. Thumbnails are being generated on demand when people view a thumbnail that doesn't exist. And since Special:NewFiles exists, and is visited every few seconds by bots, that means all new uploads have their thumbnails generated almost on the spot. Thus, we can't slow down that part. We have several long-term tasks to improve this issue, but they will take months to implement. Our only option at the moment is to try and avoid having GWToolset make too many massive images appear on Common's Special:NewFiles in a short period of time.
Over 500 of the tiff images were greater than 50 megapixels and as a
consequence Commons fails to render any thumbnails
Indeed, it seems like some thumbnail generation requests timed out due
to the size of these images. There are limits on the image scalers in regards to how long a thumbnailing job can take and these were going over the limit. To make matters worse, the current retry mechanism means that they were being retried 5 times, and thus using 5 times the resources. I would advise against trying to upload those enormous images for now, we should try to focus on a solution for the smaller images. It would be great if the next upload attempt leaves the images that are too large aside.
I think the safest option to proceed forward is to lower the
appropriate GWToolset throttles in production and then schedule a time for Fae to try the upload process again. By scheduling a specific day and time for the next attempt, we can make sure that engineers and ops have eyes on the servers to watch the load. Then if things go well, we can tweak the throttles back to higher values.
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles... ,
The throttle documentation doesn't have any unit. I understand that
it's "per background job run", but how often do these background jobs run?
what do you mean by unit? each config key in that section shows a default value to the right of it.
I couldn't find configuration values for these throttles on Commons.
Dan, can you confirm that Commons is using the default values?
throttle config values
the throttle configuration values are in the extension itself, conhttp:// git.wikimedia.org/blob/mediawiki%2Fextensions%2FGWToolset.git/d27991ca8168e47152605d73e41b2960333b470a/includes%2FConfig.php, and can be overridden in the http://git.wikimedia.org/tree/operations%2Fmediawiki-config.gitwmf-config/Co... file in the if ( $wmgUseGWToolset ) { section.
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue. we could change those values, but that would be a pity for people uploading smaller file sizes. hopefully, we could instead make it clear to the uploader that if their file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
GWToolset\Config::$mediafile_job_throttle_default = new_value GWToolset\Config::$mediafile_job_throttle_max = new_value
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
On Mon, Apr 28, 2014 at 11:17 AM, dan entous d_entous@yahoo.com
wrote:
GWToolset already has several throttles in place,
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF ops,
I
have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the first screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail
rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- -Aaron S
If they are not "delayed" then yes, it won't do much to limit the files/per job. Setting $wgJobBackoffThrottling might be useful here.
On Tue, Apr 29, 2014 at 9:05 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Up to 16 jobs among all GWT job types can be picked at once (1 runner per
16 servers).
Are they picked up at a steady frequency?
Basically, if each job treats less images in one run, will it truly space out each image's treatment in time? Or will it have no effect because each GWT job will complete faster, with the next GWT job being picked up right after the previous one on a given server completes?
On Tue, Apr 29, 2014 at 5:56 PM, Aaron Schulz aschulz@wikimedia.orgwrote:
Up to 16 jobs among all GWT job types can be picked at once (1 runner per 16 servers).
On Tue, Apr 29, 2014 at 8:41 AM, Gilles Dubuc gilles@wikimedia.orgwrote:
what do you mean by unit? each config key in that section shows a
default value to the right of it.
I want to figure out how many background job runs we end up with per minute/per hour in practice. So I meant units such as X/minute, Y/hour. I know that it's dependent on how the background jobs are configured, but this throttle figures section of the documentation doesn't help figure that out. Makes it hard for anyone to pick a figure, because it's hard to know what the number represents.
hopefully, we could instead make it clear to the uploader that if their
file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
Guidelines sound like a good idea. If I'm following this logic correctly, though, doesn't that mean that there's also a risk that separate users might "step on each other's toes" in terms of resources, if they happen to be uploading content at the same time? Basically, if a given user sets a threshold which is a fine value for isolated use, isn't the risk that the threshold ends up being too high if more than one GWToolset user is uploading to Commons at the same time? At first I thought that the limit was on the Commons server side, but your remark seems to suggest that this is configured on the uploader's side.
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
CCing Aaron.
On Tue, Apr 29, 2014 at 4:19 PM, dan entous d_entous@yahoo.com wrote:
On Apr 29, 2014, at 15:10 , Gilles Dubuc gilles@wikimedia.org wrote:
Hi Dan,
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
The issue is that there is no application generating thumbnails at a
given rate. Thumbnails are being generated on demand when people view a thumbnail that doesn't exist. And since Special:NewFiles exists, and is visited every few seconds by bots, that means all new uploads have their thumbnails generated almost on the spot. Thus, we can't slow down that part. We have several long-term tasks to improve this issue, but they will take months to implement. Our only option at the moment is to try and avoid having GWToolset make too many massive images appear on Common's Special:NewFiles in a short period of time.
Over 500 of the tiff images were greater than 50 megapixels and as a
consequence Commons fails to render any thumbnails
Indeed, it seems like some thumbnail generation requests timed out
due to the size of these images. There are limits on the image scalers in regards to how long a thumbnailing job can take and these were going over the limit. To make matters worse, the current retry mechanism means that they were being retried 5 times, and thus using 5 times the resources. I would advise against trying to upload those enormous images for now, we should try to focus on a solution for the smaller images. It would be great if the next upload attempt leaves the images that are too large aside.
I think the safest option to proceed forward is to lower the
appropriate GWToolset throttles in production and then schedule a time for Fae to try the upload process again. By scheduling a specific day and time for the next attempt, we can make sure that engineers and ops have eyes on the servers to watch the load. Then if things go well, we can tweak the throttles back to higher values.
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles... ,
The throttle documentation doesn't have any unit. I understand that
it's "per background job run", but how often do these background jobs run?
what do you mean by unit? each config key in that section shows a default value to the right of it.
I couldn't find configuration values for these throttles on Commons.
Dan, can you confirm that Commons is using the default values?
throttle config values
the throttle configuration values are in the extension itself, conhttp:// git.wikimedia.org/blob/mediawiki%2Fextensions%2FGWToolset.git/d27991ca8168e47152605d73e41b2960333b470a/includes%2FConfig.php, and can be overridden in the http://git.wikimedia.org/tree/operations%2Fmediawiki-config.gitwmf-config/Co... file in the if ( $wmgUseGWToolset ) { section.
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue. we could change those values, but that would be a pity for people uploading smaller file sizes. hopefully, we could instead make it clear to the uploader that if their file sizes exceed Xmb then they should set that throttle to 1 and make sure the engineers and ops are notified in advance about the upload.
GWToolset\Config::$mediafile_job_throttle_default = new_value GWToolset\Config::$mediafile_job_throttle_max = new_value
job run frequency
how often are the background jobs run? is there a limit on how many GWToolset Mediafile background jobs are picked up at once?
i don’t know. aaron schultz would be the best person to ask. on the beta cluster it seemed to vary between 7-30 minutes, but that may have been because of testing or other activity on that server.
On Mon, Apr 28, 2014 at 11:17 AM, dan entous d_entous@yahoo.com
wrote:
GWToolset already has several throttles in place,
http://www.mediawiki.org/wiki/Extension:GWToolset/Technical_Design#Throttles..., that limit how many background uploads are picked up with each background job run, and how many total GWToolset background jobs can exist in the entire job queue. on the beta cluster the background job seemed to vary in regards to how often it ran for GWToolset varying between 7-30. that seems like enough time for additional images to get processed in-between GWToolset images.
wouldn’t it be better to throttle the application/tool that generates
thumbnails so that it doesn’t try to produce too many thumbnails at once?
with kind regards, dan
On Apr 25, 2014, at 20:41 , Gergo Tisza gtisza@wikimedia.org wrote:
On Fri, Apr 25, 2014 at 11:13 AM, Fæ faewik@gmail.com wrote: With no obvious immediate fix/work-around on the table from WMF
ops, I
have proposed to re-start my uploads for this project with an effective throttle by using 2 threads (this is a setting on the
first
screen of the GWToolset. In practice, having tried a run of a couple of hundred, this means that the tool is uploading 100MB sized images at a rate of 2 every 5 minutes. This seems to not be causing any issues.
The issue was not directly with the uploads; there is no thumbnail
rendering happening on upload, so GWToolset adding lots of large TIFFs quickly would not cause problems in itself. The upload speed was problematic because that meant GWToolset saturated pages like Special:NewFiles, and when somebody looked at such pages, *that* triggered lots of thumbnail renderings of huge TIFF files at the same time. If GWToolset is slowed down and lots of miscellaneous files are uploaded between the TIFFs, those special pages won't be problematic, but something like a gallery or category of huge TIFF files could still be.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- -Aaron S
From a user perspective, it would be neat if we could set something
like "high / medium / low" priority at the time of creating the upload request. Some GWT users might be trying to process their image collection in time for a planned edit-a-thon or similar, so taking a week to upload might be a problem (it is taking me this long to get through the NYPL maps uploads).
This would give operations more choices on how to handle medium/low priority requests, as it would be fair to intermittently pause these, or make them as slow as you like, during peak demand times. In fact it would make sense if a queue of "low" priority were limited to sharing the one processing thread.
All of this really works if GWT can give the user some feedback on progress. At the moment the only way I know of doing this is to run an series of API queries checking whether everything on my list of images exists yet or not, itself probably creating more stress on the servers.
Fae
On 29/04/2014, Aaron Schulz aschulz@wikimedia.org wrote:
If they are not "delayed" then yes, it won't do much to limit the files/per job. Setting $wgJobBackoffThrottling might be useful here.
On Apr 29, 2014 12:46 PM, "Fæ" faewik@gmail.com wrote:
is to run an series of API queries checking whether everything on my list of images exists yet or not, itself probably creating more stress on the servers.
Totally different set of servers. So I think not cumulative.
-Jeremy
On Tue, Apr 29, 2014 at 7:19 AM, dan entous d_entous@yahoo.com wrote:
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue.
Just to verify that I am reading the code right:
- GWToolset has two main job types (and a third for the final cleanup, but that's not relevant now), UploadMetadataJob and UploadMediafileJob. The names are misleading: actually UploadMediafileJob does all the uploading, while UploadMetadataJob acts as a sort of daemon and spawns new UploadMediafileJob intances. Since they sound similar enough that they are easy to mistake for each other when skimming the text, I'll just say controller job and worker job instead. - when the user starts the batch upload, GWToolset creates a controller job. This job will remain in existence until the upload ends. It handles delays by recreating itself - every time it is invoked, it reads some records from the XML, dispatches some workers, creates a clone of itself with an increased current record counter, puts it at the end of the job queue and exits. The clone is scheduled to run no sooner than now() + $metadata_job_delay (1 minute) . - every time the controller runs, it creates N worker instances, where N is a user-supplied parameter limited by $mediafile_job_throttle_max (20). - the controller does not do anything (other than rescheduling itself) if there are more than $mediafile_job_queue_max (1000) workers. This is a global limit. - every worker instance handles the upload of a single file.
Is the above correct? If so, that means that currently there are at most 20 file uploads per user per minute (less if the user was nice and chose a lower limit). Right now, the upload frequence for normal (manually uploaded) files on Commons is about 5 files per minute (very unscientific number, I just looked at the recent changes page), so a full-speed GWToolset run would fill about 80% of Special:NewFiles. Pulling a number from thin air, we could set this number to 10% and see how well that works - what would mean one file per 2 minutes, no more than 5 at a time (since Special:NewFiles shows 50 files). That would mean setting $mediafile_job_throttle_max to 10 and setting $metadata_job_delay to 20 minutes (since Dan said that in practice it can take significantly more time than $metadata_job_delay for the control job to refresh itself, increasing $metadata_job_delay is preferable to decreasing mediafile_job_throttle_max, otherwise we might end up with a much lower speed than what we aimed for).
This would take care of throttling a single user, but multiple users can still flood the uploads. We could set $mediafile_job_queue_max to the same value as $mediafile_job_throttle_max, that would ensure that it is a global limit, but GWToolset jobs are designed to die if they run into $mediafile_job_queue_max several times, so that would cause lots of failed uploads. As far as I can see, GWToolset has no settings to deal with this problem.
The other alternative is to ignore GWToolset's own throttling settings and use $wgJobBackoffThrottling as Aaron suggested. That basically creates a global lock for 1 / $wgJobBackoffThrottling[$jobtype] minutes every time a job of the given type is run; so if we set it to 0.5, that would guarantee that GWToolset uploads never happen within two minutes of each other, regardless of any GWToolset settings. We will still want to change those settings to ensure that it does not send jobs to the queue significantly faster than they can be processed, otherwise large uploads would error out.
So we would set
\GWToolset\Config::$mediafile_job_throttle_max = 10; \GWToolsetConfig::$metadata_job_delay = '20 minutes'; $wgJobBackoffThrottling['gwtoolsetUploadMediafileJob'] = 0.5;
Does that sound reasonable?
On Tue, May 6, 2014 at 6:37 PM, Gergo Tisza gtisza@wikimedia.org wrote:
That would mean setting $mediafile_job_throttle_max to 10 and setting $metadata_job_delay to 20 minutes
5 and 10 minutes, of course, since we want to limit to 10% of the 50 items in the new files page.
$wgJobBackoffThrottling['gwtoolsetUploadMediafileJob'] = 0.5;
Uh, that's jobs per second, so it should be
$wgJobBackoffThrottling['gwtoolsetUploadMediafileJob'] = 30;
Also, throttling the controller job is probably better as that way it will not fill up the queue and error out when multiple people use GWToolset the same time.
So the correct version would be:
\GWToolset\Config::$mediafile_job_throttle_max = 10; $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 0.5;
(Since we are throttling the control job, not the workers, we don't need $metadata_job_delay which would be overridden by $wgJobBackoffThrottling anyway.)
Uhh... let's give this another shot in the morning.
I went through last day's upload logs; on average there are ~600 uploads an hour, the peak was 1900, the negative peak around 240. (The numbers are at http://pastebin.com/raw.php?i=wmBRJm1G in case anybody finds them useful.) So that's around 4 files per minute in worst case.
If we are aiming for no more than 10% of Special:NewFiles to be taken up by GWToolset, that means 5 uploads per run of the control job (10% of the 50 slots at Special:NewFiles) - the upload jobs can't really be throttled, so we must make sure they come in small enough chunks, no matter how much delay there is between the chunks). Also, we want to keep below 10% of the total Commons upload rate - that means 24 images per hour, which is roughly five runs of the control job per hour.
So the correct config is
GWToolset\Config::$mediafile_job_throttle_default = 5; $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600;
I'm leaving the max throttle at 20 so that people who are uploading small, non-TIFF images can get a somewhat higher speed.
From the Ops & Multimedia mailing lists:
We just had a brief imagescaler outage today at approx. 11:20 UTC that
was investigated and NYPL maps were found to be the cause of the outage.
As Gergo's unanswered recent message in this thread suggested, we're actively working on a number of changes to stabilize GWToolset and improve image scaler performance in order to avoid such outages. I assumed that since everyone involved is participating in this thread, that you were waiting for these changes to happen before restarting the GWToolset job that caused the previous outage a couple of weeks ago, or that you would warn us when that job would be run again. There seems to be a communication issue here. By running this job, you've taken down thumbnail generation on Commons (and all WMF wikis) and we were lucky that someone from Ops was around, noticed it and reacted quickly. This could have been easily avoided with better coordination, by at least scheduling a time to run your next attempt, with people from Ops watching servers at the time the job is run. Please make sure that this happens for the next batch of NYPL maps/massive files that you plan to upload with GWToolset. All it takes is scheduling a day and time for the next upload attempt.
Gergo and I will keep replying to this thread to notify everyone when our related code changes are merged.
On Wed, May 7, 2014 at 10:26 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Uhh... let's give this another shot in the morning.
I went through last day's upload logs; on average there are ~600 uploads an hour, the peak was 1900, the negative peak around 240. (The numbers are at http://pastebin.com/raw.php?i=wmBRJm1G in case anybody finds them useful.) So that's around 4 files per minute in worst case.
If we are aiming for no more than 10% of Special:NewFiles to be taken up by GWToolset, that means 5 uploads per run of the control job (10% of the 50 slots at Special:NewFiles) - the upload jobs can't really be throttled, so we must make sure they come in small enough chunks, no matter how much delay there is between the chunks). Also, we want to keep below 10% of the total Commons upload rate - that means 24 images per hour, which is roughly five runs of the control job per hour.
So the correct config is
GWToolset\Config::$mediafile_job_throttle_default = 5; $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600;
I'm leaving the max throttle at 20 so that people who are uploading small, non-TIFF images can get a somewhat higher speed.
Thanks for the update. I think the only GWT project that has caused the loading issue has been the NYPL maps upload, due to both the large file sizes and the tiff format used. I do have a project in the wings that has similar tiff sizes (100MB+)... some collections from the Library of Congress that I paused last month while I moved over to using the GWT. I have no plan to restart the LoC uploads in the next couple of weeks. The issue does not seem to be triggers by upload rate per se, as the NYPL maps were being uploaded at an average of around 1 file per minute.
I'll wait for a message from yourselves on this list before attempting these, or I'll get in touch directly with yourself. I believe there are actually only around 300 images left to upload from the NYPL collection, there might be more later in the year if I can diagnose why around 40% do not seem to be available via their API/catalogue.
In the coming week I'm planning a Rijksmuseum upload, however these are much smaller jpg files so I do not believe there is any need to delay that project.[2]
Links 1. https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps 2. https://commons.wikimedia.org/wiki/Commons:Batch_uploading/Art_of_Japan_in_t...
Fae
On 12 May 2014 10:58, Gilles Dubuc gilles@wikimedia.org wrote:
From the Ops & Multimedia mailing lists:
We just had a brief imagescaler outage today at approx. 11:20 UTC that was investigated and NYPL maps were found to be the cause of the outage.
As Gergo's unanswered recent message in this thread suggested, we're actively working on a number of changes to stabilize GWToolset and improve image scaler performance in order to avoid such outages. I assumed that since everyone involved is participating in this thread, that you were waiting for these changes to happen before restarting the GWToolset job that caused the previous outage a couple of weeks ago, or that you would warn us when that job would be run again. There seems to be a communication issue here. By running this job, you've taken down thumbnail generation on Commons (and all WMF wikis) and we were lucky that someone from Ops was around, noticed it and reacted quickly. This could have been easily avoided with better coordination, by at least scheduling a time to run your next attempt, with people from Ops watching servers at the time the job is run. Please make sure that this happens for the next batch of NYPL maps/massive files that you plan to upload with GWToolset. All it takes is scheduling a day and time for the next upload attempt.
Gergo and I will keep replying to this thread to notify everyone when our related code changes are merged.
On Wed, May 7, 2014 at 10:26 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Uhh... let's give this another shot in the morning.
I went through last day's upload logs; on average there are ~600 uploads an hour, the peak was 1900, the negative peak around 240. (The numbers are at http://pastebin.com/raw.php?i=wmBRJm1G in case anybody finds them useful.) So that's around 4 files per minute in worst case.
If we are aiming for no more than 10% of Special:NewFiles to be taken up by GWToolset, that means 5 uploads per run of the control job (10% of the 50 slots at Special:NewFiles) - the upload jobs can't really be throttled, so we must make sure they come in small enough chunks, no matter how much delay there is between the chunks). Also, we want to keep below 10% of the total Commons upload rate - that means 24 images per hour, which is roughly five runs of the control job per hour.
So the correct config is
GWToolset\Config::$mediafile_job_throttle_default = 5; $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600;
I'm leaving the max throttle at 20 so that people who are uploading small, non-TIFF images can get a somewhat higher speed.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
first, i suggest that we put off all large image uploads, > 10mb ( unless we have a concrete value that would work ), until we resolve the thumbnail issue.
during the zürich hackathon i spoke with aaron schultz, faiden liambotis, and brion vibber regarding approaches to dealing with this issue. in summary, the idea aaron came up with is to create initial thumbnails on download of the original mediafile to the wiki. this should block the appearance of the title on the new files page and anywhere else until the thumbnails and title creation/edit have completed. aaron thought, and faidon and i agree, that further throttling of gwtoolset will not help resolve the issue.
i am currently looking into implementing this approach.
with kind regards, dan
On May 12, 2014, at 11:58 , Gilles Dubuc gilles@wikimedia.org wrote:
From the Ops & Multimedia mailing lists:
We just had a brief imagescaler outage today at approx. 11:20 UTC that was investigated and NYPL maps were found to be the cause of the outage.
As Gergo's unanswered recent message in this thread suggested, we're actively working on a number of changes to stabilize GWToolset and improve image scaler performance in order to avoid such outages. I assumed that since everyone involved is participating in this thread, that you were waiting for these changes to happen before restarting the GWToolset job that caused the previous outage a couple of weeks ago, or that you would warn us when that job would be run again. There seems to be a communication issue here. By running this job, you've taken down thumbnail generation on Commons (and all WMF wikis) and we were lucky that someone from Ops was around, noticed it and reacted quickly. This could have been easily avoided with better coordination, by at least scheduling a time to run your next attempt, with people from Ops watching servers at the time the job is run. Please make sure that this happens for the next batch of NYPL maps/massive files that you plan to upload with GWToolset. All it takes is scheduling a day and time for the next upload attempt.
Gergo and I will keep replying to this thread to notify everyone when our related code changes are merged.
On Wed, May 7, 2014 at 10:26 PM, Gergo Tisza gtisza@wikimedia.org wrote: Uhh... let's give this another shot in the morning.
I went through last day's upload logs; on average there are ~600 uploads an hour, the peak was 1900, the negative peak around 240. (The numbers are at http://pastebin.com/raw.php?i=wmBRJm1G in case anybody finds them useful.) So that's around 4 files per minute in worst case.
If we are aiming for no more than 10% of Special:NewFiles to be taken up by GWToolset, that means 5 uploads per run of the control job (10% of the 50 slots at Special:NewFiles) - the upload jobs can't really be throttled, so we must make sure they come in small enough chunks, no matter how much delay there is between the chunks). Also, we want to keep below 10% of the total Commons upload rate - that means 24 images per hour, which is roughly five runs of the control job per hour.
So the correct config is
GWToolset\Config::$mediafile_job_throttle_default = 5; $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600;
I'm leaving the max throttle at 20 so that people who are uploading small, non-TIFF images can get a somewhat higher speed.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Thanks Dan, sounds like a great plan for this specific situation.
On Mon, May 12, 2014 at 12:27 PM, dan-nl dan.entous.wikimedia@gmail.comwrote:
first, i suggest that we put off all large image uploads, > 10mb ( unless we have a concrete value that would work ), until we resolve the thumbnail issue.
during the zürich hackathon i spoke with aaron schultz, faiden liambotis, and brion vibber regarding approaches to dealing with this issue. in summary, the idea aaron came up with is to create initial thumbnails on download of the original mediafile to the wiki. this should block the appearance of the title on the new files page and anywhere else until the thumbnails and title creation/edit have completed. aaron thought, and faidon and i agree, that further throttling of gwtoolset will not help resolve the issue.
i am currently looking into implementing this approach.
with kind regards, dan
On May 12, 2014, at 11:58 , Gilles Dubuc gilles@wikimedia.org wrote:
From the Ops & Multimedia mailing lists:
We just had a brief imagescaler outage today at approx. 11:20 UTC that was investigated and NYPL maps were found to be the cause of the outage.
As Gergo's unanswered recent message in this thread suggested, we're
actively working on a number of changes to stabilize GWToolset and improve image scaler performance in order to avoid such outages. I assumed that since everyone involved is participating in this thread, that you were waiting for these changes to happen before restarting the GWToolset job that caused the previous outage a couple of weeks ago, or that you would warn us when that job would be run again. There seems to be a communication issue here. By running this job, you've taken down thumbnail generation on Commons (and all WMF wikis) and we were lucky that someone from Ops was around, noticed it and reacted quickly. This could have been easily avoided with better coordination, by at least scheduling a time to run your next attempt, with people from Ops watching servers at the time the job is run. Please make sure that this happens for the next batch of NYPL maps/massive files that you plan to upload with GWToolset. All it takes is scheduling a day and time for the next upload attempt.
Gergo and I will keep replying to this thread to notify everyone when
our related code changes are merged.
On Wed, May 7, 2014 at 10:26 PM, Gergo Tisza gtisza@wikimedia.org
wrote:
Uhh... let's give this another shot in the morning.
I went through last day's upload logs; on average there are ~600 uploads
an hour, the peak was 1900, the negative peak around 240. (The numbers are at http://pastebin.com/raw.php?i=wmBRJm1G in case anybody finds them useful.) So that's around 4 files per minute in worst case.
If we are aiming for no more than 10% of Special:NewFiles to be taken up
by GWToolset, that means 5 uploads per run of the control job (10% of the 50 slots at Special:NewFiles) - the upload jobs can't really be throttled, so we must make sure they come in small enough chunks, no matter how much delay there is between the chunks). Also, we want to keep below 10% of the total Commons upload rate - that means 24 images per hour, which is roughly five runs of the control job per hour.
So the correct config is
GWToolset\Config::$mediafile_job_throttle_default = 5; $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600;
I'm leaving the max throttle at 20 so that people who are uploading
small, non-TIFF images can get a somewhat higher speed.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On 12 May 2014 11:27, dan-nl dan.entous.wikimedia@gmail.com wrote:
first, i suggest that we put off all large image uploads, > 10mb ( unless we have a concrete value that would work ), until we resolve the thumbnail issue.
during the zürich hackathon i spoke with aaron schultz, faiden liambotis, and brion vibber regarding approaches to dealing with this issue. in summary, the idea aaron came up with is to create initial thumbnails on download of the original mediafile to the wiki. this should block the appearance of the title on the new files page and anywhere else until the thumbnails and title creation/edit have completed. aaron thought, and faidon and i agree, that further throttling of gwtoolset will not help resolve the issue.
i am currently looking into implementing this approach.
with kind regards, dan
If you can set up an illustrative example (maybe doing it "by hand") so we can see how the file history and so forth would look, then it might be easier to discuss on-wiki. In the case of the Library of Congress, their database has "webpage quality" jpgs available as well as larger jpgs and tiffs. It might be possible to pass the GWT an xml file with a link to a thumbnail image as well as the tiff rather than relying on automated generation somewhere else. The tricky part (I think) would be doing this for a mass tiff upload, which is actually the only example we have of stressing the WMF servers, as the initial file would have to be formatted as a tiff rather than a jpeg.
I agree, from what we have seen, this is not a simple throttling issue. I suspect even 1 file every 5 minutes could cause an issue if there is a backlog of thumbnail creation at peak times.
Let me know if you would like my last NYPL maps xml file to play around with as an example (it can be emailed as it is only 750k). These have yet to be uploaded are were the cause of the most recent problem.
Fae
Just a quick note that technical discussion about this is mostly happening on the multimedia list (recently, the "Brief image scalers outage, Mon Apr 21 03:12 UTC" and "Limiting bandwidth when reading from swift" threads).
Hi Gergo,
Was there a conclusion from discussions elsewhere?
I have some rather nice >100MB tiffs to upload from the Library of Congress, so would like to be able to run these through GWT without making more waves. :-)
Fae
On 14 May 2014 19:23, Gergo Tisza gtisza@wikimedia.org wrote:
Just a quick note that technical discussion about this is mostly happening on the multimedia list (recently, the "Brief image scalers outage, Mon Apr 21 03:12 UTC" and "Limiting bandwidth when reading from swift" threads).
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Hi Fae,
I have some rather nice >100MB tiffs
How large is that batch?
We're still working on technical changes, nothing has been merged since the last outage.
On Mon, May 19, 2014 at 7:54 PM, Fæ faewik@gmail.com wrote:
Hi Gergo,
Was there a conclusion from discussions elsewhere?
I have some rather nice >100MB tiffs to upload from the Library of Congress, so would like to be able to run these through GWT without making more waves. :-)
Fae
On 14 May 2014 19:23, Gergo Tisza gtisza@wikimedia.org wrote:
Just a quick note that technical discussion about this is mostly
happening
on the multimedia list (recently, the "Brief image scalers outage, Mon
Apr
21 03:12 UTC" and "Limiting bandwidth when reading from swift" threads).
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae Personal and confidential, please do not circulate or re-quote.
On 20 May 2014 13:12, Gilles Dubuc gilles@wikimedia.org wrote:
I have some rather nice >100MB tiffs
How large is that batch?
We're still working on technical changes, nothing has been merged since the last outage.
It should be small, as it is the exception that is over 50MB, let alone 100MB. Some of it is tidying up where I skipped 100MB files previously (the 19thC. British Cartoons collection). I would *guess* no more than 100 or 200 in a day. I can actually choose my xml to limit the overall daily number if that is a concern and you would like to suggest a number. (Sidenote - preparing the xml metadata to discover which files to upload is slow due to LoC API limits of 15 requests per minute for "security" reasons - I was unaware of this until I contacted the LoC a couple of days ago. This is not a project that can be rushed through.)
I am happy to kick these off on 2 threads maximum, which should mean something like a maximum possible throughput rate for large files of less than c.500 in a day. 1 thread would presumably be half that.
I will put aside the small number of remaining NPYL map files - there is no hurry and it would be good to use these "trouble making" files to test out the technical changes when they are implemented.
PS Beta cluster is still not working for me today, I get the standard server down page every time I run GWT there - I have not tried the production environment in the last week.
Fae
Let's start with the minimum (1 thread?), with images spread apart as far as possible from each other during the day and see how it goes. We'll keep an eye on the server load every day and see if there's room for increasing the rate. Week days would be highly preferable for us.
On Tue, May 20, 2014 at 2:27 PM, Fæ faewik@gmail.com wrote:
On 20 May 2014 13:12, Gilles Dubuc gilles@wikimedia.org wrote:
I have some rather nice >100MB tiffs
How large is that batch?
We're still working on technical changes, nothing has been merged since
the
last outage.
It should be small, as it is the exception that is over 50MB, let alone 100MB. Some of it is tidying up where I skipped 100MB files previously (the 19thC. British Cartoons collection). I would *guess* no more than 100 or 200 in a day. I can actually choose my xml to limit the overall daily number if that is a concern and you would like to suggest a number. (Sidenote - preparing the xml metadata to discover which files to upload is slow due to LoC API limits of 15 requests per minute for "security" reasons - I was unaware of this until I contacted the LoC a couple of days ago. This is not a project that can be rushed through.)
I am happy to kick these off on 2 threads maximum, which should mean something like a maximum possible throughput rate for large files of less than c.500 in a day. 1 thread would presumably be half that.
I will put aside the small number of remaining NPYL map files - there is no hurry and it would be good to use these "trouble making" files to test out the technical changes when they are implemented.
PS Beta cluster is still not working for me today, I get the standard server down page every time I run GWT there - I have not tried the production environment in the last week.
Fae
faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae
dear all,
i have the following bug open to track this issue: https://bugzilla.wikimedia.org/show_bug.cgi?id=65217. unfortunately i’m already overbooked on other projects and won’t be able to address this fully for approximately 3 weeks; maybe sooner. hopefully, after that time the “fix” will work as expected, be merged into the master branch and deployed within 4 weeks ... or possibly the wmf operations and multimedia team may have a broader solution in place for thumbnail generation.
if any of you are able to take a look at the issue in regards to GWToolset and develop the patch that would be great. the basic concept for resolving the issue is to alter the mediafile job so that it creates the necessary thumbnails before creating the file page.
1. store the original media file in the upload stash 2. create the appropriate thumbnails a. determine the appropriate sizes b. create the largest first c. create any subsequent thumbnails sizes based on that first thumbnail 3. move the original and thumbnails into the appropriate image directories 4. create the file page
with kind regards, dan
dear all,
the wmf multimedia and operations teams have worked on several bugs that address this issue. i have summarised the bugs i know about in https://bugzilla.wikimedia.org/show_bug.cgi?id=65217.
as far as i understand it, it‘s time to test another large file upload. if you have a batch of jpgs, tiffs, pdfs, ogvs or any other accepted format over 50mb (this is an arbitrary nr, i have no idea where the threshold is at; earlier i mentioned 10mb), please feel free to:
1. test the upload, or a part of it, on beta: http://commons.wikimedia.beta.wmflabs.org/wiki/Special:GWToolset
2. if all runs well, coordinate with #wikimedia-operations before running the batch on production, so that they can monitor its progress. i’m not sure how much lead time they will need, so i suggest giving them at least a day notice and then an hour right before you run the upload.
with kind regards, dan
Hi all,
with bug 65691 https://bugzilla.wikimedia.org/show_bug.cgi?id=65691 fixed (the last patch was deployed today), now might be a good time to test large TIFF uploads again (the patch is limited to TIFF files for now). I was thinking of the following schedule:
- wait until Monday (no breaking the site on the weekends) - launch an upload with large TIFF files (preferably the same one that caused issues earlier, ie. Fae's NYPL maps project) - make sure that the images are initially not categorized to avoid someone triggering 200 new thumbnail requests in parallel (GWToolset could add an emtpy template instead, and that template can be replaced with the category later). - initially use the minimum speed allowed by GWToolset (a single thread), to make sure Special:NewFiles and co. will also not be the source of many concurrent requests. - after a bunch of images have been uploaded, generate a gallery with 10 thumbnails and monitor imagescaler load and Swift traffic in the process. Repeat with 20, 50 etc until we are satisfied that the scalers are resilient to many concurrent requests for large files. - if all works out, the upload project can continue with normal speed (20 threads or whatever), and we can also relax throttle limits on GWToolset a bit.
Does this sound reasonable? Fae, are you interested in doing this?
On 10/07/2014, Gergo Tisza gtisza@wikimedia.org wrote:
Hi all,
with bug 65691 https://bugzilla.wikimedia.org/show_bug.cgi?id=65691 fixed (the last patch was deployed today), now might be a good time to test large TIFF uploads again (the patch is limited to TIFF files for now). I was thinking of the following schedule:
- wait until Monday (no breaking the site on the weekends)
- launch an upload with large TIFF files (preferably the same one that
caused issues earlier, ie. Fae's NYPL maps project)
- make sure that the images are initially not categorized to avoid someone
triggering 200 new thumbnail requests in parallel (GWToolset could add an emtpy template instead, and that template can be replaced with the category later).
- initially use the minimum speed allowed by GWToolset (a single thread),
to make sure Special:NewFiles and co. will also not be the source of many concurrent requests.
- after a bunch of images have been uploaded, generate a gallery with 10
thumbnails and monitor imagescaler load and Swift traffic in the process. Repeat with 20, 50 etc until we are satisfied that the scalers are resilient to many concurrent requests for large files.
- if all works out, the upload project can continue with normal speed (20
threads or whatever), and we can also relax throttle limits on GWToolset a bit.
Does this sound reasonable? Fae, are you interested in doing this?
Two test sets suggested below. Is this list of TIFFs needing thumbnails rendered enough? I'm hesitant to commit to putting together a set of brand new files to test myself in the next few days as there are several other things I need to get on with (as well as "RL" stuff), for example I'd ought to test out a sample of the Wellcome uploads I have been sent (the disk will be ready next week if the WMF give me details of where to send it). Unfortunately the Library of Congress TIFFs that I am in the middle of putting through GWT do not tell me their resolution before I upload them - this makes it impossible for me to suggest a set of 50MP+ files to upload from scratch. The next couple on my backlog after the HABS collection (Wellcome and Rijksmuseum) are all jpegs rather than tiffs.
SET ONE (HABS)
I have been uploading many TIFFs from the Library of Congress, many well over 130 MP, however they are much smaller filesizes that the NYPL collections for example: * https://commons.wikimedia.org/wiki/File:Eastburn-Jeanes_Limekilns,_On_Paperm...
These have yet to be rendered with thumbnails and there are over 4,000 of them you can find listed at: * https://commons.wikimedia.org/wiki/Category:Uploads_by_F%C3%A6_%28over_50_MP...
They might be good as a speed test to blast through and to create the thumbnails. I don't know if this means overwriting the current files or if there is some trick to forcing the thumbnail recreation. If these work well, then I'll stop creating PNGs for the TIFF drawings I'm uploading (there are going to be something like another 20,000+ of these to come as part of HABS uploading).
SET TWO (NYPL)
The un-rendered NYPL maps collection numbers around 1,100 and is at: https://commons.wikimedia.org/wiki/Category:NYPL_maps_%28over_50_megapixels%...
These are both large in resolution and many are *very* large in filesize.
Fae
They might be good as a speed test to blast through and to create the thumbnails. I don't know if this means overwriting the current files or if there is some trick to forcing the thumbnail recreation. If these work well, then I'll stop creating PNGs for the TIFF drawings I'm uploading (there are going to be something like another 20,000+ of these to come as part of HABS uploading).
Doing ?action=purge on an image description page will force all thumbnails to be re-rendered.
However I don't see much point in doing that. I'd much prefer people concentrate on doing new useful work instead of artificial "tests".
--bawolff
On Thu, Jul 10, 2014 at 5:13 PM, Brian Wolff bawolff@gmail.com wrote:
However I don't see much point in doing that. I'd much prefer people concentrate on doing new useful work instead of artificial "tests".
Currently the use of GWToolset is limited because of operational problems in the past; lifting that limitation *is* useful work.
If ops are fine we re-enabling it without any testing, I have no problem with that. Would you like to propose it to them? :)
Alternatively, we can just wait until the other improvements are done so that less testing is needed. Again, if Fae and other people involved in GLAM uploads are OK with keeping the current limitations for some more weeks, fine with me.
On 10/07/2014, Fæ faewik@gmail.com wrote:
or if there is some trick to forcing the thumbnail recreation. If these work well, then I'll stop creating PNGs for the TIFF drawings
Ah, rereading the bugzilla ticket (oops, I should have paid more attention) shows that this will not fix the thumbnail creation issue, just reduce the stress on the system when TIFFs appear that are too large. My PNG creation routine will continue to churn away. :-)
Fae
https://commons.wikimedia.org/wiki/Category:NYPL_maps_%28over_50_megapixels%... https://commons.wikimedia.org/wiki/Category:Uploads_by_F%C3%A6_%28over_50_MP...
these only contain >50MP images, and if I understand correctly, thumbnail requests for those are currently refused, so they can't generate any load. (They used to be refused only after sending the original through Swift, and that might have contributed to the scaler outage, but Bawolff fixed that in 135101 https://gerrit.wikimedia.org/r/#/c/135101/.)
I would suggest continuing the NYPL upload; no problem if most of the images are small, I can just filter out the large ones and build a gallery out of them. Or if you don't have time for that, and don't plan on working in the near future on any uploads that are affected, we can just do the whole thing another time.
On Thu, Jul 10, 2014 at 3:36 PM, Fæ faewik@gmail.com wrote:
On 10/07/2014, Gergo Tisza gtisza@wikimedia.org wrote:
Hi all,
with bug 65691 https://bugzilla.wikimedia.org/show_bug.cgi?id=65691
fixed
(the last patch was deployed today), now might be a good time to test
large
TIFF uploads again (the patch is limited to TIFF files for now). I was thinking of the following schedule:
- wait until Monday (no breaking the site on the weekends)
- launch an upload with large TIFF files (preferably the same one that
caused issues earlier, ie. Fae's NYPL maps project)
- make sure that the images are initially not categorized to avoid
someone
triggering 200 new thumbnail requests in parallel (GWToolset could add an emtpy template instead, and that template can be replaced with the
category
later).
- initially use the minimum speed allowed by GWToolset (a single thread),
to make sure Special:NewFiles and co. will also not be the source of many concurrent requests.
- after a bunch of images have been uploaded, generate a gallery with 10
thumbnails and monitor imagescaler load and Swift traffic in the process. Repeat with 20, 50 etc until we are satisfied that the scalers are resilient to many concurrent requests for large files.
- if all works out, the upload project can continue with normal speed (20
threads or whatever), and we can also relax throttle limits on GWToolset
a
bit.
Does this sound reasonable? Fae, are you interested in doing this?
Two test sets suggested below. Is this list of TIFFs needing thumbnails rendered enough? I'm hesitant to commit to putting together a set of brand new files to test myself in the next few days as there are several other things I need to get on with (as well as "RL" stuff), for example I'd ought to test out a sample of the Wellcome uploads I have been sent (the disk will be ready next week if the WMF give me details of where to send it). Unfortunately the Library of Congress TIFFs that I am in the middle of putting through GWT do not tell me their resolution before I upload them - this makes it impossible for me to suggest a set of 50MP+ files to upload from scratch. The next couple on my backlog after the HABS collection (Wellcome and Rijksmuseum) are all jpegs rather than tiffs.
SET ONE (HABS)
I have been uploading many TIFFs from the Library of Congress, many well over 130 MP, however they are much smaller filesizes that the NYPL collections for example:
https://commons.wikimedia.org/wiki/File:Eastburn-Jeanes_Limekilns,_On_Paperm...
These have yet to be rendered with thumbnails and there are over 4,000 of them you can find listed at:
https://commons.wikimedia.org/wiki/Category:Uploads_by_F%C3%A6_%28over_50_MP...
They might be good as a speed test to blast through and to create the thumbnails. I don't know if this means overwriting the current files or if there is some trick to forcing the thumbnail recreation. If these work well, then I'll stop creating PNGs for the TIFF drawings I'm uploading (there are going to be something like another 20,000+ of these to come as part of HABS uploading).
SET TWO (NYPL)
The un-rendered NYPL maps collection numbers around 1,100 and is at:
https://commons.wikimedia.org/wiki/Category:NYPL_maps_%28over_50_megapixels%...
These are both large in resolution and many are *very* large in filesize.
Fae
faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae
On Thu, Jul 10, 2014 at 3:08 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Hi all,
with bug 65691 https://bugzilla.wikimedia.org/show_bug.cgi?id=65691 fixed (the last patch was deployed today), now might be a good time to test large TIFF uploads again (the patch is limited to TIFF files for now). I was thinking of the following schedule:
- wait until Monday (no breaking the site on the weekends)
- launch an upload with large TIFF files (preferably the same one that
caused issues earlier, ie. Fae's NYPL maps project)
- make sure that the images are initially not categorized to avoid
someone triggering 200 new thumbnail requests in parallel (GWToolset could add an emtpy template instead, and that template can be replaced with the category later).
- initially use the minimum speed allowed by GWToolset (a single thread),
to make sure Special:NewFiles and co. will also not be the source of many concurrent requests.
- after a bunch of images have been uploaded, generate a gallery with 10
thumbnails and monitor imagescaler load and Swift traffic in the process. Repeat with 20, 50 etc until we are satisfied that the scalers are resilient to many concurrent requests for large files.
- if all works out, the upload project can continue with normal speed (20
threads or whatever), and we can also relax throttle limits on GWToolset a bit.
Does this sound reasonable? Fae, are you interested in doing this?
For the record, this is on hold because no one seems interested in doing mass TIFF uploads at the moment. Whenever someone plans doing that, ping me or anyone from the Multimedia team and we can get back to this.
No worries. I will have time to set something up in a couple of weeks.
The Wellcome uploads will be a last minute priority before Wikimania, my focus will probably be on that. On 18 Jul 2014 19:43, "Gergo Tisza" gtisza@wikimedia.org wrote:
On Thu, Jul 10, 2014 at 3:08 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Hi all,
with bug 65691 https://bugzilla.wikimedia.org/show_bug.cgi?id=65691 fixed (the last patch was deployed today), now might be a good time to test large TIFF uploads again (the patch is limited to TIFF files for now). I was thinking of the following schedule:
- wait until Monday (no breaking the site on the weekends)
- launch an upload with large TIFF files (preferably the same one that
caused issues earlier, ie. Fae's NYPL maps project)
- make sure that the images are initially not categorized to avoid
someone triggering 200 new thumbnail requests in parallel (GWToolset could add an emtpy template instead, and that template can be replaced with the category later).
- initially use the minimum speed allowed by GWToolset (a single thread),
to make sure Special:NewFiles and co. will also not be the source of many concurrent requests.
- after a bunch of images have been uploaded, generate a gallery with 10
thumbnails and monitor imagescaler load and Swift traffic in the process. Repeat with 20, 50 etc until we are satisfied that the scalers are resilient to many concurrent requests for large files.
- if all works out, the upload project can continue with normal speed (20
threads or whatever), and we can also relax throttle limits on GWToolset a bit.
Does this sound reasonable? Fae, are you interested in doing this?
For the record, this is on hold because no one seems interested in doing mass TIFF uploads at the moment. Whenever someone plans doing that, ping me or anyone from the Multimedia team and we can get back to this.
On 7 May 2014 02:37, Gergo Tisza gtisza@wikimedia.org wrote: ...
files on Commons is about 5 files per minute (very unscientific number, I just looked at the recent changes page), so a full-speed GWToolset run would
...
The consequent estimates appear off here. In practice, I have run 5 threads using GWT and the results were an average of 1 file per minute (i.e. 5 files every 5 minutes). These were 50MB to 110MB sized files, so waiting on the source website might add to delays.
Fae
On Wed, May 7, 2014 at 12:32 AM, Fæ faewik@gmail.com wrote:
The consequent estimates appear off here. In practice, I have run 5 threads using GWT and the results were an average of 1 file per minute (i.e. 5 files every 5 minutes). These were 50MB to 110MB sized files, so waiting on the source website might add to delays.
In theory it shouldn't: the scheduling and the uploads are done by two different jobs. The scheduler job reschedules itself for 1 minute later as soon as it started, then adds (# of threads) new upload jobs. Those jobs taking long does not delay the scheduler at all. Of course, just because a job schedules itself to run in a minute, there is no guarantee it is actually picked in that timeframe, so this might just depend on how congested the job queue is. We'll have to try and see.