Best practices for read/write vs read-only requests, and our multi-DC future

List overview All Threads
Download

newer

older

My username

Discovery Weekly Update for the...

Brion Vibber

21 Apr 2016 21 Apr '16

11:45 a.m.

Over in TimedMediaHandler extension, we've had a number of cases where old code did things that were convenient in terms of squishing read-write operations into data getters, that got removed due to problems with long running transactions or needing to refactor things to support future-facing multi-DC work where we want requests to be able to more reliably distinguish between read-only and read-write. And we sometimes want to put some of those clever hacks back and add more. ;)

For instance in https://gerrit.wikimedia.org/r/284368 we'd like to remove transcode derivative files of types/resolutions that have been disabled automatically when we come across them. But I'm a bit unsure it's safe to do so.

Note that we could fire off a job queue background task to do the actual removal... But is it also safe to do that on a read-only request? https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacen... seems to indicate job queueing will be safe, but would like to confirm that. :)

Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to trigger missing transcodes to run on demand, similarly. The actual re encoding happens in a background job, but we have to fire it off, and we have to record that we fired it off so we don't duplicate it...

(This would require a second queue to do the high-priority state table update and queue the actual transcoding job; we can't put them in one queue because a backup of transcode jobs would prevent the high priority job from running in a timely fashion.)

A best practices document on future-proofing for multi DC would be pretty awesome! Maybe factor out some of the stuff from the RfC into a nice dev doc page...

-- brion

Show replies by date

bawolff

21 Apr 21 Apr

7:38 p.m.

New subject: Best practices for read/write vs read-only requests, and our multi-DC future

On Thu, Apr 21, 2016 at 1:45 AM, Brion Vibber bvibber@wikimedia.org wrote:

...

Over in TimedMediaHandler extension, we've had a number of cases where old code did things that were convenient in terms of squishing read-write operations into data getters, that got removed due to problems with long running transactions or needing to refactor things to support future-facing multi-DC work where we want requests to be able to more reliably distinguish between read-only and read-write. And we sometimes want to put some of those clever hacks back and add more. ;)

For instance in https://gerrit.wikimedia.org/r/284368 we'd like to remove transcode derivative files of types/resolutions that have been disabled automatically when we come across them. But I'm a bit unsure it's safe to do so.

Note that we could fire off a job queue background task to do the actual removal... But is it also safe to do that on a read-only request? https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacen... seems to indicate job queueing will be safe, but would like to confirm that. :)

Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to trigger missing transcodes to run on demand, similarly. The actual re encoding happens in a background job, but we have to fire it off, and we have to record that we fired it off so we don't duplicate it...

(This would require a second queue to do the high-priority state table update and queue the actual transcoding job; we can't put them in one queue because a backup of transcode jobs would prevent the high priority job from running in a timely fashion.)

A best practices document on future-proofing for multi DC would be pretty awesome! Maybe factor out some of the stuff from the RfC into a nice dev doc page...

-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

When doing something like that from a read request, there's also the problem for a popular page that there might be lots of views (maybe thousands if the queue is a little backed up) before the job is processed. So if the view triggers the job, and it will only stop triggering inserting the job after the job has been executed, this might cause a large number of useless jobs to be en-queued until one of them is finally executed.

-- -bawolff

Brion Vibber

8:26 p.m.

New subject: Best practices for read/write vs read-only requests, and our multi-DC future

On Thursday, April 21, 2016, bawolff bawolff+wn@gmail.com wrote:

...

When doing something like that from a read request, there's also the problem for a popular page that there might be lots of views (maybe thousands if the queue is a little backed up) before the job is processed. So if the view triggers the job, and it will only stop triggering inserting the job after the job has been executed, this might cause a large number of useless jobs to be en-queued until one of them is finally executed.

Can PoolCounter help with that? Docs are a little sparse, but it was put in place to prevent that sort of backup trying to run the same stuff over and over simultaneously.

-- brion

...

-- -bawolff

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Erik Bernhardson

8:59 p.m.

New subject: Best practices for read/write vs read-only requests, and our multi-DC future

On Apr 20, 2016 10:45 PM, "Brion Vibber" bvibber@wikimedia.org wrote:

...

Over in TimedMediaHandler extension, we've had a number of cases where old code did things that were convenient in terms of squishing read-write operations into data getters, that got removed due to problems with long running transactions or needing to refactor things to support future-facing multi-DC work where we want requests to be able to more reliably distinguish between read-only and read-write. And we sometimes want to put some of those clever hacks back and add more. ;)

For instance in https://gerrit.wikimedia.org/r/284368 we'd like to remove transcode derivative files of types/resolutions that have been disabled automatically when we come across them. But I'm a bit unsure it's safe to do so.

Note that we could fire off a job queue background task to do the actual removal... But is it also safe to do that on a read-only request?

https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacen...

...

seems to indicate job queueing will be safe, but would like to confirm that. :)

I think this is the preferred method. My understanding is that the jobs will get shipped to the primary DC job queue.

...

Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to trigger missing transcodes to run on demand, similarly. The actual re encoding happens in a background job, but we have to fire it off, and we have to record that we fired it off so we don't duplicate it...

(This would require a second queue to do the high-priority state table update and queue the actual transcoding job; we can't put them in one

queue

...

because a backup of transcode jobs would prevent the high priority job

from

...

running in a timely fashion.)

The job queue can do deduplication, although you would have to check if that is active while the job is running and not only while queued. Might help?

...

A best practices document on future-proofing for multi DC would be pretty awesome! Maybe factor out some of the stuff from the RfC into a nice dev doc page...

-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

9:26 p.m.

New subject: Best practices for read/write vs read-only requests, and our multi-DC future

On Thu, Apr 21, 2016 at 4:59 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...

On Apr 20, 2016 10:45 PM, "Brion Vibber" bvibber@wikimedia.org wrote:

...
Note that we could fire off a job queue background task to do the actual removal... But is it also safe to do that on a read-only request?

https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacen...

...
seems to indicate job queueing will be safe, but would like to confirm that. :)

I think this is the preferred method. My understanding is that the jobs will get shipped to the primary DC job queue.

*nod* looks like per spec that should work with few surprises.

...

...
Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to trigger missing transcodes to run on demand, similarly. The actual re encoding happens in a background job, but we have to fire it off, and we have to record that we fired it off so we don't duplicate it...

[snip]

...
The job queue can do deduplication, although you would have to check if that is active while the job is running and not only while queued. Might help?

Part of the trick is we want to let the user know that the job has been queued; and if the job errors out, we want the user to know that the job errored out.

Currently this means we have to update a row in the 'transcode' table (TimedMediaHandler-specific info about the transcoded derivative files) when we fire off the job, then update its state again when the job actually runs.

If that's split into two queues, one lightweight and one heavyweight, then this might make sense:

* N web requests hit something using File:Foobar.webm, which has a missing transcode * they each try to queue up a job to the lightweight queue that says "start queueing this to actually transcode!" * when the job queue runner on the lightweight queue sees the first such job, it records the status update to the database and queues up a heavyweight job to run the actual transcoding. The N-1 remaining jobs duped on the same title/params either get removed, or never got stored in the first place; I forget how it works. :) * ... time passes, during which further web requests don't yet see the updated database table state, and keep queueing in the lightweight queue. * lightweight queue runners see some of those jobs, but they have the updated master database state and know they don't need to act. * database replication of the updated state hits the remote DC * ..time passes, during which further web requests see the updated database table state and don't bother queueing the lightweight job * eventually, the heavyweight job runs, completes, updates the states at start and end. * eventually, the database replicates the transcode state completion to the remote DC. * web requests start seeing the completed state, and their output includes the updated transcode information.

It all feels a bit complex, and I wonder if we could build some common classes to help with this transaction model. I'm pretty sure we can be making more use of background jobs outside of TimedMediaHandler's slow video format conversions. :D

-- brion

Brion Vibber

23 Apr 23 Apr

6:26 p.m.

New subject: Best practices for read/write vs read-only requests, and our multi-DC future

I've opened a phab task https://phabricator.wikimedia.org/T133448 about writing up good intro docs and updating other docs to match it.

Feel free y'all to add to that or hang additional tasks onto it like better utility classes to help folks transition code to background jobs... And maybe infrastructure to make sure we're handling those jobs reliably on small sites without dedicated job runners.

-- brion On Apr 21, 2016 5:26 PM, "Brion Vibber" bvibber@wikimedia.org wrote:

...

On Thu, Apr 21, 2016 at 4:59 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
On Apr 20, 2016 10:45 PM, "Brion Vibber" bvibber@wikimedia.org wrote:

...
Note that we could fire off a job queue background task to do the actual removal... But is it also safe to do that on a read-only request?

https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacen...

...
seems to indicate job queueing will be safe, but would like to confirm that. :)

I think this is the preferred method. My understanding is that the jobs will get shipped to the primary DC job queue.

*nod* looks like per spec that should work with few surprises.

...
...
Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to trigger missing transcodes to run on demand, similarly. The actual re encoding happens in a background job, but we have to fire it off, and we have to record that we fired it off so we don't duplicate it...

[snip]

...
The job queue can do deduplication, although you would have to check if that is active while the job is running and not only while queued. Might help?

Part of the trick is we want to let the user know that the job has been queued; and if the job errors out, we want the user to know that the job errored out.

Currently this means we have to update a row in the 'transcode' table (TimedMediaHandler-specific info about the transcoded derivative files) when we fire off the job, then update its state again when the job actually runs.

If that's split into two queues, one lightweight and one heavyweight, then this might make sense:

N web requests hit something using File:Foobar.webm, which has a missing

transcode

they each try to queue up a job to the lightweight queue that says

"start queueing this to actually transcode!"

when the job queue runner on the lightweight queue sees the first such

job, it records the status update to the database and queues up a heavyweight job to run the actual transcoding. The N-1 remaining jobs duped on the same title/params either get removed, or never got stored in the first place; I forget how it works. :)

... time passes, during which further web requests don't yet see the

updated database table state, and keep queueing in the lightweight queue.

lightweight queue runners see some of those jobs, but they have the

updated master database state and know they don't need to act.

database replication of the updated state hits the remote DC

..time passes, during which further web requests see the updated

database table state and don't bother queueing the lightweight job

eventually, the heavyweight job runs, completes, updates the states at

start and end.

eventually, the database replicates the transcode state completion to

the remote DC.

web requests start seeing the completed state, and their output includes

the updated transcode information.

It all feels a bit complex, and I wonder if we could build some common classes to help with this transaction model. I'm pretty sure we can be making more use of background jobs outside of TimedMediaHandler's slow video format conversions. :D

-- brion

3181

Age (days ago)

3183

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

bawolff
Brion Vibber
Erik Bernhardson