This week for our IRC bug triage, I decided to focus on problems reported with caching. We focused on six bugs.
You can read the logs of the discussion: http://hexm.de/54 The etherpad: http://etherpad.wikimedia.org/BugTriage-2011-07
http://bugzilla.wikimedia.org/20468 — User::invalidateCache throws 1205: Lock wait timeout exceeded
These lock timeouts happen frequently enough that we can start to track them down. A Tim said, to solve this: “We should reduce the transaction time and number of locks in a transaction.”
Since these are showing up enough, we'll start to log the backtrace, figure out where it is being called and add commit() where necessary.
http://bugzilla.wikimedia.org/26338 — Wikimedia Javascript and CSS files are getting an extra max-age cache-control param
This bug was filed back before ResourceLoader was deployed. After Ryan confirmed that this was less of a problem now that it was less of a problem now, he pointed to a couple of places that files are still served without ResourceLoader that would benefit from adding Apache directives.
http://bugzilla.wikimedia.org/26360 — Disabling sessions in memcached produces open() error
Before we got to this one in triage Chad was already busy investigating it. He thinks this was broken way back in r49370. Under “You broke it you buy it”, he is fixing the problem.
http://bugzilla.wikimedia.org/29223 — Querying for rvdiffto=prev fails for many revids: "notcached"
Sam has reportedly been working at this one and may have already fixed it in trunk. I’ll check with him.
All was not lost in the discussion of this bug, though. It reminded Tim that there is a similar problem with action=parse. It only fetches from the parser cache, it doesn't store to it. This problem reduces our parser cache hit ratio significantly since we have a growing number of action=parse hits due to Android and iPhone apps.
I filed a new bug to fix the problem Tim mentioned: http://bugzilla.wikimedia.org/29907
http://bugzilla.wikimedia.org/29384 — Load order of request in IE6 messes with dependancy resolving (mediawiki.util not available in time)
Krinkle has been looking into this one but doesn't yet know what is causing it. Perhaps he and Trevor will have time to look at it in this coming week when he is in San Francisco.
http://bugzilla.wikimedia.org/29552 — Squid cache of redirect pages don't get purged when page it redirects to gets edited
Much of the discussion for this bug and the next one overlapped, but Tim suggested that we should be seeing the same problems with templatelinks as we are ssing with redirect pages.
Roan responded that he thought there frequently were problems with templatelinks but that they were mis-attributed to the job queue instead of squid problems.
http://bugzilla.wikimedia.org/28613 — Thumbnails of updated files fail to purge on squids
There is lots of speculation as to *what* is causing these problems. Initially, we thought the squid caching problem was a symptom of a hardware issue that the new routers being installed week would fix.
With the new routers in place, though, it became clear that this wasn't simply a matter of faulty hardware. After some discussion, we thought packet loss (perhaps because MediaWiki does not throttle the UDP packets it sends) might be a cause. I filed a ticket in RT (http://rt.wikimedia.org/Ticket/Display.html?id=1174) to get Ops to add listeners to the multicast group so that we could see if there was any packet loss and, if so, where it was coming from.
If it turns out that there is no packet loss (or other network problems), then we'll have to look at MediaWiki itself.
Thanks to everyone's participation, I felt like this week's triage was especially productive.
Till next week,
Mark.