Notes for the IRC bug triage - Wikitech-l

16 Jul 2011

This week for our IRC bug triage, I decided to focus on problems
reported with caching.  We focused on six bugs.

You can read the logs of the discussion: http://hexm.de/54
The etherpad: http://etherpad.wikimedia.org/BugTriage-2011-07

http://bugzilla.wikimedia.org/20468 — User::invalidateCache throws
    1205: Lock wait timeout exceeded

    These lock timeouts happen frequently enough that we can start to
    track them down.  A Tim said, to solve this: “We should reduce the
    transaction time and number of locks in a transaction.”

    Since these are showing up enough, we'll start to log the
    backtrace, figure out where it is being called and add commit()
    where necessary.

http://bugzilla.wikimedia.org/26338 — Wikimedia Javascript and CSS
    files are getting an extra max-age cache-control param

    This bug was filed back before ResourceLoader was deployed.  After
    Ryan confirmed that this was less of a problem now that it was
    less of a problem now, he pointed to a couple of places that files
    are still served without ResourceLoader that would benefit from
    adding Apache directives.

http://bugzilla.wikimedia.org/26360 — Disabling sessions in memcached
    produces open() error

    Before we got to this one in triage Chad was already busy
    investigating it.  He thinks this was broken way back in r49370.
    Under “You broke it you buy it”, he is fixing the problem.

http://bugzilla.wikimedia.org/29223 — Querying for rvdiffto=prev fails
    for many revids: "notcached"

    Sam has reportedly been working at this one and may have already
    fixed it in trunk.  I’ll check with him.

    All was not lost in the discussion of this bug, though.  It
    reminded Tim that there is a similar problem with action=parse.
    It only fetches from the parser cache, it doesn't store to it.
    This problem reduces our parser cache hit ratio significantly
    since we have a growing number of action=parse hits due to Android
    and iPhone apps.

    I filed a new bug to fix the problem Tim mentioned:
    http://bugzilla.wikimedia.org/29907

http://bugzilla.wikimedia.org/29384 — Load order of request in IE6
    messes with dependancy resolving (mediawiki.util not available in
    time)

    Krinkle has been looking into this one but doesn't yet know what
    is causing it.  Perhaps he and Trevor will have time to look at it
    in this coming week when he is in San Francisco.

http://bugzilla.wikimedia.org/29552 — Squid cache of redirect pages
    don't get purged when page it redirects to gets edited

    Much of the discussion for this bug and the next one overlapped,
    but Tim suggested that we should be seeing the same problems with
    templatelinks as we are ssing with redirect pages.

    Roan responded that he thought there frequently were problems with
    templatelinks but that they were mis-attributed to the job queue
    instead of squid problems.

http://bugzilla.wikimedia.org/28613 — Thumbnails of updated files fail
    to purge on squids

    There is lots of speculation as to *what* is causing these
    problems.  Initially, we thought the squid caching problem was a
    symptom of a hardware issue that the new routers being installed
    week would fix.

    With the new routers in place, though, it became clear that this
    wasn't simply a matter of faulty hardware.  After some discussion,
    we thought packet loss (perhaps because MediaWiki does not
    throttle the UDP packets it sends) might be a cause.  I filed a
    ticket in RT (http://rt.wikimedia.org/Ticket/Display.html?id=1174)
    to get Ops to add listeners to the multicast group so that we
    could see if there was any packet loss and, if so, where it was
    coming from.

    If it turns out that there is no packet loss (or other network
    problems), then we'll have to look at MediaWiki itself.

Thanks to everyone's participation, I felt like this week's triage was
especially productive.

Till next week,

Mark.