I've deleted all the slow refreshLinks2 jobs which have apparently been preventing the job queue from making any headway for the last few months. Some people report that they have received hundreds of edit notification emails in the last few hours, due to the months of backlog now being cleared.
-- Tim Starling
On Mon, Feb 16, 2009 at 9:18 AM, Tim Starling tstarling@wikimedia.org wrote:
I've deleted all the slow refreshLinks2 jobs which have apparently been preventing the job queue from making any headway for the last few months. Some people report that they have received hundreds of edit notification emails in the last few hours, due to the months of backlog now being cleared.
So are there no alarm bells that go off when the job queue is unreasonably long, or do people just not listen to them? Perhaps we could have a bot in #wikimedia-tech that would complain every hour if the oldest job in the queue is more than X days old?
2009/2/16 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Mon, Feb 16, 2009 at 9:18 AM, Tim Starling tstarling@wikimedia.org wrote:
I've deleted all the slow refreshLinks2 jobs which have apparently been preventing the job queue from making any headway for the last few months. Some people report that they have received hundreds of edit notification emails in the last few hours, due to the months of backlog now being cleared.
So are there no alarm bells that go off when the job queue is unreasonably long, or do people just not listen to them? Perhaps we could have a bot in #wikimedia-tech that would complain every hour if the oldest job in the queue is more than X days old?
Alternatively, the number of jobs processed per request could be made a function of the length of the backlog (in terms of time) - the longer the backlog is, the faster we process jobs. Then if the job queue get to being months behind we would all notice it because everything would start running really slowly. (Obviously, the length of the job queue needs to be added to whatever diagnostic screen the devs first check when the site slows down, otherwise it won't help much.)
On Mon, Feb 16, 2009 at 11:02 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
Alternatively, the number of jobs processed per request could be made a function of the length of the backlog (in terms of time) - the longer the backlog is, the faster we process jobs. Then if the job queue get to being months behind we would all notice it because everything would start running really slowly.
Jobs are not processed on requests. They're processed by a cron job. You can't just automatically run them at a crazy rate, because that will cause slave lag and other bad stuff. If too many are accumulating, it's probably due to a programming error that needs to be found and fixed by human inspection. (Tim just made several commits fixing things that were spewing out too many jobs.)
(Obviously, the length of the job queue needs to be added to whatever diagnostic screen the devs first check when the site slows down, otherwise it won't help much.)
#wikimedia-tech has enough people that regular warnings posted there would probably get noticed.
2009/2/16 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Mon, Feb 16, 2009 at 11:02 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
Alternatively, the number of jobs processed per request could be made a function of the length of the backlog (in terms of time) - the longer the backlog is, the faster we process jobs. Then if the job queue get to being months behind we would all notice it because everything would start running really slowly.
Jobs are not processed on requests. They're processed by a cron job.
According to the documentation, by default they are run on requests, does Wikimedia not use that default?
On Mon, Feb 16, 2009 at 11:33 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
According to the documentation, by default they are run on requests, does Wikimedia not use that default?
That's correct, it doesn't. The default is really only for easier installation on shared hosting where cron might not be available (and perhaps Windows, although I imagine that has some cron equivalent).
Aryeh Gregor wrote:
If too many are accumulating, it's probably due to a programming error that needs to be found and fixed by human inspection. (Tim just made several commits fixing things that were spewing out too many jobs.)
(Obviously, the length of the job queue needs to be added to whatever diagnostic screen the devs first check when the site slows down, otherwise it won't help much.)
#wikimedia-tech has enough people that regular warnings posted there would probably get noticed.
People did complain about long job queue on #wikimedia-tech. I don't think they were taken too seriously.
On Mon, Feb 16, 2009 at 6:20 PM, Platonides Platonides@gmail.com wrote:
People did complain about long job queue on #wikimedia-tech. I don't think they were taken too seriously.
Yes, because they're not a bot who a) we know is actually noting a real problem instead of subjective impressions, and who b) spams the complaint on an ongoing basis like nagios does.
Part of the problem is that the measure of job queue length we really care about is "what was the last job executed?", not "how many jobs are in the queue?". If we added a job_timestamp column and put an index on it, we could replace (or supplement) the cruddy poor-quality estimate we have now with a probably more useful and certainly more accurate one.
Aryeh Gregor wrote:
Part of the problem is that the measure of job queue length we really care about is "what was the last job executed?", not "how many jobs are in the queue?". If we added a job_timestamp column and put an index on it, we could replace (or supplement) the cruddy poor-quality estimate we have now with a probably more useful and certainly more accurate one.
Agree. Adding job queue lag to Special:Statistics would benefit both users and sysadmins.
Platonides wrote:
Aryeh Gregor wrote:
Part of the problem is that the measure of job queue length we really care about is "what was the last job executed?", not "how many jobs are in the queue?". If we added a job_timestamp column and put an index on it, we could replace (or supplement) the cruddy poor-quality estimate we have now with a probably more useful and certainly more accurate one.
Agree. Adding job queue lag to Special:Statistics would benefit both users and sysadmins.
I've been toying with some additions to the job queue so that we have some semblance about what is going on. Time stamp, actual stats of progress and if we want to get extra fancy, better view of what the job workers are doing.
Just simple things to help humans analyze what is going. And if were lucky .. maybe tell us why.
Now that I'm back, I'm hoping to have something ready for Brion and everyone to look at soon.
--tomasz
On Feb 16, 2009, at 7:32 AM, Aryeh Gregor <Simetrical +wikilist@gmail.com> wrote:
So are there no alarm bells that go off when the job queue is unreasonably long, or do people just not listen to them? Perhaps we could have a bot in #wikimedia-tech that would complain every hour if the oldest job in the queue is more than X days old?
The job queue does not have a timestamp field.
Andrew Garrett
2009/2/16 Andrew Garrett andrew@epstone.net:
On Feb 16, 2009, at 7:32 AM, Aryeh Gregor <Simetrical +wikilist@gmail.com> wrote:
So are there no alarm bells that go off when the job queue is unreasonably long, or do people just not listen to them? Perhaps we could have a bot in #wikimedia-tech that would complain every hour if the oldest job in the queue is more than X days old?
The job queue does not have a timestamp field.
That would be a mistake, then.
Aryeh Gregor wrote:
On Mon, Feb 16, 2009 at 9:18 AM, Tim Starling tstarling@wikimedia.org wrote:
I've deleted all the slow refreshLinks2 jobs which have apparently been preventing the job queue from making any headway for the last few months. Some people report that they have received hundreds of edit notification emails in the last few hours, due to the months of backlog now being cleared.
So are there no alarm bells that go off when the job queue is unreasonably long, or do people just not listen to them? Perhaps we could have a bot in #wikimedia-tech that would complain every hour if the oldest job in the queue is more than X days old?
If you check the server admin log, you'll find that this is the latest in a long series of attempts to fix this problem. I don't think it's completely fixed yet.
I'm not sure what good a complaining bot would do, any more than a complaining user which we seem to have plenty of. Deleting the jobs was not a solution, and can't really be repeated without breaking things. There's still a fair bit more programming to do.
-- Tim Starling
On Mon, Feb 16, 2009 at 8:20 PM, Tim Starling tstarling@wikimedia.org wrote:
I'm not sure what good a complaining bot would do, any more than a complaining user which we seem to have plenty of. Deleting the jobs was not a solution, and can't really be repeated without breaking things. There's still a fair bit more programming to do.
I misunderstood the problem, evidently. I thought it was a one-off thing due to software bugs the sysadmins didn't know about. It seems it's more like a known, ongoing problem whose cause isn't understood yet, so there's no point in bugging people about it all the time, no.
On Tue, Feb 17, 2009 at 9:27 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
I misunderstood the problem, evidently. I thought it was a one-off thing due to software bugs the sysadmins didn't know about. It seems it's more like a known, ongoing problem whose cause isn't understood yet, so there's no point in bugging people about it all the time, no.
Although, I still think a "oldest job" statistic might be useful.
Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Tue, Feb 17, 2009 at 9:27 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
I misunderstood the problem, evidently. I thought it was a one-off thing due to software bugs the sysadmins didn't know about. It seems it's more like a known, ongoing problem whose cause isn't understood yet, so there's no point in bugging people about it all the time, no.
Although, I still think a "oldest job" statistic might be useful.
Not only for statistics, it would also provide users with an opportunity to see whether it is due to the job queue or a bug if a category membership or a link/template relationship is not updated soon after the actual edit to article.
In the same way it would be nice to extend "?action=purge" to log any discrepancies it encounters between the pre- and post-purge states to ease debugging.
Tim
wikitech-l@lists.wikimedia.org