Job queue on Wiki Farms

List overview All Threads
Download

newer

older

recursiveTagParse makes data vanish

Deferred, but still in need of...

Daniel Friesen

15 Nov 2010 15 Nov '10

5:56 p.m.

There was a thought about the job queue that popped into my mind today.

From what I understand, for a Wiki Farm, in order to use runJobs.php instead of using the in-request queue (which on high traffic sites is less desireable) the Wiki Farm has to run runJobs.php periodically for each and every wiki on the farm. So, for example. If a Wiki Farm has 10,000 wiki it's hosting, say the Wiki Host really wants to ensure that the queue is run at least hourly to keep the data on the wiki reasonably up to date, the wiki farm essentially needs to call runJobs.php 10,000 times an hour (ie: one time for each individual wiki), irrelevantly of whether a wiki has jobs or not. Either that or poll each database before hand, which in itself is 10,000 database calls an hour plus the runJobs execution which still isn't that desireable.

What do people think of having another source class for the job queue like we have for file storage, text storage, etc...

The idea being that Wiki Farms would have the ability to implement a new Job Queue source which instead derives jobs from a single shared database with the same structure as the normal job queue, but with a farm specific wiki id inside the table as well. Using this method a Wiki Farm would be able to set up a cron job (or perhaps a daemon to be even more effective at dispatching the job queue runs) which instead of making 10,000 calls to runJobs outright, it would fetch a random job row from the shared job queue table, look at the wiki id inside the row and execute a runJobs (perhaps with a limit=1000) script for that wiki to dispatch the queue and run some jobs for that wiki. It would of course continue looking at random jobs from the shared table and dispatching more runJobs executions serving the role of trying to keep the job queues running for all wiki on the farm, but without making wasteful runJobs calls for a pile of wikis which have no jobs to run.

Any comments?

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Show replies by date

Brion Vibber

15 Nov 15 Nov

11:17 p.m.

I'd recommend making use of existing message queuing systems such as ActiveMQ which already provide infrastructure for distributing messages to multiple clients, redelivering after a failed attempt, etc. We've had pretty good luck with this for StatusNet, where we run a lot of processing on individual messages through background queues to keep the frontend responsive.

One difficulty is that we don't have a good system for handling data for multiple sites in one process, so it may need an intermediate process to spawn out children for actual processing.

I think Tim did a little experimental work with GearMan; did that pan out?

-- brion On Nov 15, 2010 4:27 AM, "Daniel Friesen" lists@nadir-seen-fire.com wrote:

...

There was a thought about the job queue that popped into my mind today.

From what I understand, for a Wiki Farm, in order to use runJobs.php instead of using the in-request queue (which on high traffic sites is less desireable) the Wiki Farm has to run runJobs.php periodically for each and every wiki on the farm. So, for example. If a Wiki Farm has 10,000 wiki it's hosting, say the Wiki Host really wants to ensure that the queue is run at least hourly to keep the data on the wiki reasonably up to date, the wiki farm essentially needs to call runJobs.php 10,000 times an hour (ie: one time for each individual wiki), irrelevantly of whether a wiki has jobs or not. Either that or poll each database before hand, which in itself is 10,000 database calls an hour plus the runJobs execution which still isn't that desireable.

What do people think of having another source class for the job queue like we have for file storage, text storage, etc...

The idea being that Wiki Farms would have the ability to implement a new Job Queue source which instead derives jobs from a single shared database with the same structure as the normal job queue, but with a farm specific wiki id inside the table as well. Using this method a Wiki Farm would be able to set up a cron job (or perhaps a daemon to be even more effective at dispatching the job queue runs) which instead of making 10,000 calls to runJobs outright, it would fetch a random job row from the shared job queue table, look at the wiki id inside the row and execute a runJobs (perhaps with a limit=1000) script for that wiki to dispatch the queue and run some jobs for that wiki. It would of course continue looking at random jobs from the shared table and dispatching more runJobs executions serving the role of trying to keep the job queues running for all wiki on the farm, but without making wasteful runJobs calls for a pile of wikis which have no jobs to

run.

...

Any comments?

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Friesen

11:23 p.m.

Ya, that's why I was thinking of a job queue source rather than trying to come up with some hack job on how to integrate jobs into a shared db. So that job sources that make use of message queueing systems can be used just by programming a job source that uses them instead of the db.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On 10-11-15 09:47 AM, Brion Vibber wrote:

...

I'd recommend making use of existing message queuing systems such as ActiveMQ which already provide infrastructure for distributing messages to multiple clients, redelivering after a failed attempt, etc. We've had pretty good luck with this for StatusNet, where we run a lot of processing on individual messages through background queues to keep the frontend responsive.

One difficulty is that we don't have a good system for handling data for multiple sites in one process, so it may need an intermediate process to spawn out children for actual processing.

I think Tim did a little experimental work with GearMan; did that pan out?

-- brion On Nov 15, 2010 4:27 AM, "Daniel Friesen"lists@nadir-seen-fire.com wrote:

...
There was a thought about the job queue that popped into my mind today.

From what I understand, for a Wiki Farm, in order to use runJobs.php instead of using the in-request queue (which on high traffic sites is less desireable) the Wiki Farm has to run runJobs.php periodically for each and every wiki on the farm. So, for example. If a Wiki Farm has 10,000 wiki it's hosting, say the Wiki Host really wants to ensure that the queue is run at least hourly to keep the data on the wiki reasonably up to date, the wiki farm essentially needs to call runJobs.php 10,000 times an hour (ie: one time for each individual wiki), irrelevantly of whether a wiki has jobs or not. Either that or poll each database before hand, which in itself is 10,000 database calls an hour plus the runJobs execution which still isn't that desireable.

What do people think of having another source class for the job queue like we have for file storage, text storage, etc...

The idea being that Wiki Farms would have the ability to implement a new Job Queue source which instead derives jobs from a single shared database with the same structure as the normal job queue, but with a farm specific wiki id inside the table as well. Using this method a Wiki Farm would be able to set up a cron job (or perhaps a daemon to be even more effective at dispatching the job queue runs) which instead of making 10,000 calls to runJobs outright, it would fetch a random job row from the shared job queue table, look at the wiki id inside the row and execute a runJobs (perhaps with a limit=1000) script for that wiki to dispatch the queue and run some jobs for that wiki. It would of course continue looking at random jobs from the shared table and dispatching more runJobs executions serving the role of trying to keep the job queues running for all wiki on the farm, but without making wasteful runJobs calls for a pile of wikis which have no jobs to

run.

...
Any comments?

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Roan Kattouw

16 Nov 16 Nov

10:21 p.m.

2010/11/15 Daniel Friesen lists@nadir-seen-fire.com:

...

There was a thought about the job queue that popped into my mind today.

From what I understand, for a Wiki Farm, in order to use runJobs.php instead of using the in-request queue (which on high traffic sites is less desireable) the Wiki Farm has to run runJobs.php periodically for each and every wiki on the farm. So, for example. If a Wiki Farm has 10,000 wiki it's hosting, say the Wiki Host really wants to ensure that the queue is run at least hourly to keep the data on the wiki reasonably up to date, the wiki farm essentially needs to call runJobs.php 10,000 times an hour (ie: one time for each individual wiki), irrelevantly of whether a wiki has jobs or not. Either that or poll each database before hand, which in itself is 10,000 database calls an hour plus the runJobs execution which still isn't that desireable.

Have you considered the fact that the WMF cluster is in this exact situation? ;)

However, we don't call runJobs.php for all wikis periodically. Instead, we call nextJobDB.php which generates a list of wikis that have pending jobs (by connecting to all of their DBs), caches it in memcached (caching was broken until a few minutes ago, oops) and outputs a random DB name. We then run runJobs.php on that random DB name. This whole thing is in maintenance/jobs-loop.sh

Roan Kattouw (Catrope)

Platonides

10:35 p.m.

Roan Kattouw wrote:

...

This whole thing is in maintenance/jobs-loop.sh

It's not in maintenance but in tools/jobs-loop http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/jobs-loop/

The nextJobDB.php in that dir is completely outdated, the right one is in maintenance folder.

Daniel Friesen

10:40 p.m.

On 10-11-16 08:51 AM, Roan Kattouw wrote:

...

2010/11/15 Daniel Friesenlists@nadir-seen-fire.com:

...
There was a thought about the job queue that popped into my mind today.

From what I understand, for a Wiki Farm, in order to use runJobs.php instead of using the in-request queue (which on high traffic sites is less desireable) the Wiki Farm has to run runJobs.php periodically for each and every wiki on the farm. So, for example. If a Wiki Farm has 10,000 wiki it's hosting, say the Wiki Host really wants to ensure that the queue is run at least hourly to keep the data on the wiki reasonably up to date, the wiki farm essentially needs to call runJobs.php 10,000 times an hour (ie: one time for each individual wiki), irrelevantly of whether a wiki has jobs or not. Either that or poll each database before hand, which in itself is 10,000 database calls an hour plus the runJobs execution which still isn't that desireable.

Have you considered the fact that the WMF cluster is in this exact situation? ;)

However, we don't call runJobs.php for all wikis periodically. Instead, we call nextJobDB.php which generates a list of wikis that have pending jobs (by connecting to all of their DBs), caches it in memcached (caching was broken until a few minutes ago, oops) and outputs a random DB name. We then run runJobs.php on that random DB name. This whole thing is in maintenance/jobs-loop.sh

Roan Kattouw (Catrope)

Ok, then... How many databases are in the cluster being served by nextJobDB? How long does it take to connect to all the databases and figure out what ones have pending jobs?

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Roan Kattouw

10:45 p.m.

2010/11/16 Daniel Friesen lists@nadir-seen-fire.com:

...

Ok, then... How many databases are in the cluster being served by nextJobDB? How long does it take to connect to all the databases and figure out what ones have pending jobs?

I don't know how long it takes exactly. I do know that we're caching the list for 5 minutes (this caching was broken between Sep 09 and today, causing the script to regenerate the list each time).

The list is generated by connecting to each cluster and running one large query that covers all the databases in that cluster. We have about 815 databases spread over 6 clusters (three clusters of one, one of three, one of ~20 and one with the other ~790), so we only need to connect to 6 DB servers and run one query on each.

Roan Kattouw (Catrope)

5124

Age (days ago)

5125

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Daniel Friesen
Platonides
Roan Kattouw