(added engineering@lists.wikimedia.org to recipients list)
On Fri, Mar 7, 2014 at 5:27 PM, Bryan Davis bd808@wikimedia.org wrote:
On Thursday every week a new WFM branch is cut to deploy the group0 wikis (test* and wm.o). On the following Tuesday it is promoted to the group1 wikis (all-wikipedias). Finally on Thursday is it promoted to group2 (wikipedias) while the group0 wikis start using another new version. At the current release cadence (one new branch a week) after 2 weeks in production a branch is no longer used. There can be minor exceptions to this due to major difficulties with a branch and/or holiday conflicts, but for the sake of this discussion those differences can be mostly ignored.
A branch can't be deleted from the server cluster immediately after it is removed from the last wiki however. For better or worse, each branch contains static assets from core (resources & skins) and extensions that are served by the apaches. These assets are served using versioned URLs such as https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_me.... Varnish caches pages containing these URLs for anons for up to 30 days. That means that a request for static content contained by the 1.23wmf17 branch could be needed to satisfly an apache request for up to 30 days after that branch is no longer being used to satisfy PHP backed requests. Assuming the weekly release cadence, this means that the static assets from a branch are needed on the cluster for at least 45 days (14 days of active branch use + 31 days of cached page use).
At the moment we don't have a well documented procedure for cleaning up old branches on tin and servers that rsync with tin (directly and indirectly). It seems to be a process that Sam does occasionally. The last commits that cleaned up old branches were merged on 2014-02-15: https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#/.... These commits cleaned up some truly ancient branches.
A slightly different by related problem is the amount of disk space consumed by the l10n cache files for unused MW versions. The combined json and CDB files for the current 1.23 branches consume ~1.7G per version. It looks like Sam has been pruning these at some point as well as the cache/l10n directory for version 1.23wmf12 and earlier are empty.
I recommend that we add two new weekly cleanup steps:
- When we deploy a new branch to group0 (Thursdays), all branches
retired more than 5 weeks ago should be removed. This should really only include multiple branches the first time it's done to catch up. After that it will be an "add a branch, kill a branch" situation. With the current release cadence this will keep us at 7 checked out branches on tin, 2 versions in active use and 5 waiting for potential cache references to expire.
- When we move group1 to the newest branch (Tuesdays), the cache/l10n
directory of all non-active branches should be purged. By this point there is little chance that we will be reverting the wikipedias to the N-2 branch and thus the l10n cache is just taking up disk space and slowing down rsync comparisons.
Are there any objections to adding these procedures to the MW deploy process?
Minor content correction: mentions of "30 days" should have really been "31 days". Apparently i changed it in some places before I hit send but I didn't get them all. The 31 day upper limit comes from the $wgSquidMaxage setting in InitialiseSettings.php [0]
[0]: https://git.wikimedia.org/blob/operations%2Fmediawiki-config/87e36518db5644f...
Bryan