I've opened a phab task https://phabricator.wikimedia.org/T133448 about writing up good intro docs and updating other docs to match it.
Feel free y'all to add to that or hang additional tasks onto it like better utility classes to help folks transition code to background jobs... And maybe infrastructure to make sure we're handling those jobs reliably on small sites without dedicated job runners.
-- brion On Apr 21, 2016 5:26 PM, "Brion Vibber" bvibber@wikimedia.org wrote:
On Thu, Apr 21, 2016 at 4:59 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
On Apr 20, 2016 10:45 PM, "Brion Vibber" bvibber@wikimedia.org wrote:
Note that we could fire off a job queue background task to do the actual removal... But is it also safe to do that on a read-only request?
https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacen...
seems to indicate job queueing will be safe, but would like to confirm that. :)
I think this is the preferred method. My understanding is that the jobs will get shipped to the primary DC job queue.
*nod* looks like per spec that should work with few surprises.
Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to trigger missing transcodes to run on demand, similarly. The actual re encoding happens in a background job, but we have to fire it off, and we have to record that we fired it off so we don't duplicate it...
[snip]
The job queue can do deduplication, although you would have to check if that is active while the job is running and not only while queued. Might help?
Part of the trick is we want to let the user know that the job has been queued; and if the job errors out, we want the user to know that the job errored out.
Currently this means we have to update a row in the 'transcode' table (TimedMediaHandler-specific info about the transcoded derivative files) when we fire off the job, then update its state again when the job actually runs.
If that's split into two queues, one lightweight and one heavyweight, then this might make sense:
- N web requests hit something using File:Foobar.webm, which has a missing
transcode
- they each try to queue up a job to the lightweight queue that says
"start queueing this to actually transcode!"
- when the job queue runner on the lightweight queue sees the first such
job, it records the status update to the database and queues up a heavyweight job to run the actual transcoding. The N-1 remaining jobs duped on the same title/params either get removed, or never got stored in the first place; I forget how it works. :)
- ... time passes, during which further web requests don't yet see the
updated database table state, and keep queueing in the lightweight queue.
- lightweight queue runners see some of those jobs, but they have the
updated master database state and know they don't need to act.
- database replication of the updated state hits the remote DC
- ..time passes, during which further web requests see the updated
database table state and don't bother queueing the lightweight job
- eventually, the heavyweight job runs, completes, updates the states at
start and end.
- eventually, the database replicates the transcode state completion to
the remote DC.
- web requests start seeing the completed state, and their output includes
the updated transcode information.
It all feels a bit complex, and I wonder if we could build some common classes to help with this transaction model. I'm pretty sure we can be making more use of background jobs outside of TimedMediaHandler's slow video format conversions. :D
-- brion