This is a simple sounding question, but I have two uploads going on in parallel right now, one using 8 processing threads and the other using 16, so a total of 24. None of these files is huge, they seem to be under 15mb, with an occasional outlier around 45mb (though quite a few drawing scans break the TIFF max size barrier of 50MP even though these are only a miniscule ~2.5mb in filesize).
GWT was designed for a maximum of 20 threads, and I don't know whether to feel guilty at running 24 threads this way, even though these uploads are unlikely to break anything.
Any thoughts? If what I'm doing is somehow self-regulating, I would be tempted to add another job and bump the "volume" to 40 or more threads, as this particular upload has over 100,000 images (potentially 200,000) and I'd rather it didn't take over a month to complete (which is what it is looking like right now at a rate of 2,800 images per day).
Fae
Hi Fae,
On Mon, Jun 30, 2014 at 2:38 PM, Fæ faewik@gmail.com wrote:
This is a simple sounding question, but I have two uploads going on in parallel right now, one using 8 processing threads and the other using 16, so a total of 24. None of these files is huge, they seem to be under 15mb, with an occasional outlier around 45mb (though quite a few drawing scans break the TIFF max size barrier of 50MP even though these are only a miniscule ~2.5mb in filesize).
GWT was designed for a maximum of 20 threads, and I don't know whether to feel guilty at running 24 threads this way, even though these uploads are unlikely to break anything.
The recent outages were not directly related to upload volume. Uploads do not (yet) cause thumbnails to be rendered; the thumbnail requests (which overloaded the image scaling servers) were caused by people looking at those images (maybe on Special:NewFiles, or some category page). So it is really the "image view volume" that counts; there is some relation to the upload speed (more uplads -> more images on Special:NewFiles -> more views) but it's rather indirect.
The upload volume in itself is tiny; you mention 3K uploads of 15 Mb images per day, that consumes about 0.5 Mbps bandwidth, while the the capacity is in the gigabytes. As long as the images are small and creating thumbnails for them is not particularly processing-intensive, I don't think lots of threads would be problematic. However...
Any thoughts? If what I'm doing is somehow self-regulating, I would be tempted to add another job and bump the "volume" to 40 or more threads, as this particular upload has over 100,000 images (potentially 200,000) and I'd rather it didn't take over a month to complete (which is what it is looking like right now at a rate of 2,800 images per day).
...exactly because GWToolset has self-regulating limits, there is not much point in doing that. The job throttling has recently been changed to be global instead of per-user or per-process, so adding more jobs will not speed things up. Actually, running one upload with 20 threads should be faster than running two with 24 (as the number of threads is ignored currently, but the number of uploads is factored into the throttling). (Note that this is a recent change and not yet tested in real world so take this with a grain of salt. https://gerrit.wikimedia.org/r/#/c/132112/ for details.)
A more direct way of throttling thumbnail generation will be deployed next Thursday (bug 65691 https://bugzilla.wikimedia.org/show_bug.cgi?id=65691); we might want to reconsider GWToolset throttling limits after that.
On 01/07/2014, Gergo Tisza gtisza@wikimedia.org wrote:
Any thoughts? If what I'm doing is somehow self-regulating, I would be tempted to add another job and bump the "volume" to 40 or more threads, as this particular upload has over 100,000 images (potentially 200,000) and I'd rather it didn't take over a month to complete (which is what it is looking like right now at a rate of 2,800 images per day).
...exactly because GWToolset has self-regulating limits, there is not much point in doing that. The job throttling has recently been changed to be global instead of per-user or per-process, so adding more jobs will not speed things up. Actually, running one upload with 20 threads should be faster than running two with 24 (as the number of threads is ignored currently, but the number of uploads is factored into the throttling).
Ah, good to know. I am currently running in exactly this way, 1 job with 20 threads.
we might want to reconsider GWToolset throttling limits after that.
Good. Though it's working right now, throttling the tool so that we (all users) can only upload 100,000 (or the equivalent) images in a month, i.e. ~1 million in a year, looks low in the long term. In fact I'd probably end up annoying all other users by hogging the capacity to myself if it stays the same.
To put this in context, the HABS project at the Library of Congress is supposed to have 1/4 million images by itself, and that is just one of many archives in the LoC. I have also previously discussed a UK GLAM with several million potential images. GWT would seem more than a little defective if we had to say it would take several years to upload this many. :-)
Fae