On Thursday every week a new WFM branch is cut to deploy the group0 wikis (test* and wm.o). On the following Tuesday it is promoted to the group1 wikis (all-wikipedias). Finally on Thursday is it promoted to group2 (wikipedias) while the group0 wikis start using another new version. At the current release cadence (one new branch a week) after 2 weeks in production a branch is no longer used. There can be minor exceptions to this due to major difficulties with a branch and/or holiday conflicts, but for the sake of this discussion those differences can be mostly ignored.
A branch can't be deleted from the server cluster immediately after it is removed from the last wiki however. For better or worse, each branch contains static assets from core (resources & skins) and extensions that are served by the apaches. These assets are served using versioned URLs such as https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_me.... Varnish caches pages containing these URLs for anons for up to 30 days. That means that a request for static content contained by the 1.23wmf17 branch could be needed to satisfly an apache request for up to 30 days after that branch is no longer being used to satisfy PHP backed requests. Assuming the weekly release cadence, this means that the static assets from a branch are needed on the cluster for at least 45 days (14 days of active branch use + 31 days of cached page use).
At the moment we don't have a well documented procedure for cleaning up old branches on tin and servers that rsync with tin (directly and indirectly). It seems to be a process that Sam does occasionally. The last commits that cleaned up old branches were merged on 2014-02-15: https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#/.... These commits cleaned up some truly ancient branches.
A slightly different by related problem is the amount of disk space consumed by the l10n cache files for unused MW versions. The combined json and CDB files for the current 1.23 branches consume ~1.7G per version. It looks like Sam has been pruning these at some point as well as the cache/l10n directory for version 1.23wmf12 and earlier are empty.
I recommend that we add two new weekly cleanup steps:
* When we deploy a new branch to group0 (Thursdays), all branches retired more than 5 weeks ago should be removed. This should really only include multiple branches the first time it's done to catch up. After that it will be an "add a branch, kill a branch" situation. With the current release cadence this will keep us at 7 checked out branches on tin, 2 versions in active use and 5 waiting for potential cache references to expire.
* When we move group1 to the newest branch (Tuesdays), the cache/l10n directory of all non-active branches should be purged. By this point there is little chance that we will be reverting the wikipedias to the N-2 branch and thus the l10n cache is just taking up disk space and slowing down rsync comparisons.
Are there any objections to adding these procedures to the MW deploy process?
Bryan
(added engineering@lists.wikimedia.org to recipients list)
On Fri, Mar 7, 2014 at 5:27 PM, Bryan Davis bd808@wikimedia.org wrote:
On Thursday every week a new WFM branch is cut to deploy the group0 wikis (test* and wm.o). On the following Tuesday it is promoted to the group1 wikis (all-wikipedias). Finally on Thursday is it promoted to group2 (wikipedias) while the group0 wikis start using another new version. At the current release cadence (one new branch a week) after 2 weeks in production a branch is no longer used. There can be minor exceptions to this due to major difficulties with a branch and/or holiday conflicts, but for the sake of this discussion those differences can be mostly ignored.
A branch can't be deleted from the server cluster immediately after it is removed from the last wiki however. For better or worse, each branch contains static assets from core (resources & skins) and extensions that are served by the apaches. These assets are served using versioned URLs such as https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_me.... Varnish caches pages containing these URLs for anons for up to 30 days. That means that a request for static content contained by the 1.23wmf17 branch could be needed to satisfly an apache request for up to 30 days after that branch is no longer being used to satisfy PHP backed requests. Assuming the weekly release cadence, this means that the static assets from a branch are needed on the cluster for at least 45 days (14 days of active branch use + 31 days of cached page use).
At the moment we don't have a well documented procedure for cleaning up old branches on tin and servers that rsync with tin (directly and indirectly). It seems to be a process that Sam does occasionally. The last commits that cleaned up old branches were merged on 2014-02-15: https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#/.... These commits cleaned up some truly ancient branches.
A slightly different by related problem is the amount of disk space consumed by the l10n cache files for unused MW versions. The combined json and CDB files for the current 1.23 branches consume ~1.7G per version. It looks like Sam has been pruning these at some point as well as the cache/l10n directory for version 1.23wmf12 and earlier are empty.
I recommend that we add two new weekly cleanup steps:
- When we deploy a new branch to group0 (Thursdays), all branches
retired more than 5 weeks ago should be removed. This should really only include multiple branches the first time it's done to catch up. After that it will be an "add a branch, kill a branch" situation. With the current release cadence this will keep us at 7 checked out branches on tin, 2 versions in active use and 5 waiting for potential cache references to expire.
- When we move group1 to the newest branch (Tuesdays), the cache/l10n
directory of all non-active branches should be purged. By this point there is little chance that we will be reverting the wikipedias to the N-2 branch and thus the l10n cache is just taking up disk space and slowing down rsync comparisons.
Are there any objections to adding these procedures to the MW deploy process?
Minor content correction: mentions of "30 days" should have really been "31 days". Apparently i changed it in some places before I hit send but I didn't get them all. The 31 day upper limit comes from the $wgSquidMaxage setting in InitialiseSettings.php [0]
[0]: https://git.wikimedia.org/blob/operations%2Fmediawiki-config/87e36518db5644f...
Bryan
I recall during some fundraising adventures with CentralNotice that in some cases things were persisting in cache beyond the expiry of $wgSquidMaxage. We were debating setting $wgCacheEpoch [0] before I just went through and issued manual purges on all the affected pages (also causing an outage of swift because it couldn't handle a lot of deletes...).
[0] https://www.mediawiki.org/wiki/Manual:$wgCacheEpoch
~Matt Walker Wikimedia Foundation Fundraising Technology Team
On Mon, Mar 10, 2014 at 10:29 AM, Bryan Davis bd808@wikimedia.org wrote:
(added engineering@lists.wikimedia.org to recipients list)
On Fri, Mar 7, 2014 at 5:27 PM, Bryan Davis bd808@wikimedia.org wrote:
On Thursday every week a new WFM branch is cut to deploy the group0 wikis (test* and wm.o). On the following Tuesday it is promoted to the group1 wikis (all-wikipedias). Finally on Thursday is it promoted to group2 (wikipedias) while the group0 wikis start using another new version. At the current release cadence (one new branch a week) after 2 weeks in production a branch is no longer used. There can be minor exceptions to this due to major difficulties with a branch and/or holiday conflicts, but for the sake of this discussion those differences can be mostly ignored.
A branch can't be deleted from the server cluster immediately after it is removed from the last wiki however. For better or worse, each branch contains static assets from core (resources & skins) and extensions that are served by the apaches. These assets are served using versioned URLs such as
https://bits.wikimedia.org/static-1.23wmf17/skins/common/images/poweredby_me... .
Varnish caches pages containing these URLs for anons for up to 30 days. That means that a request for static content contained by the 1.23wmf17 branch could be needed to satisfly an apache request for up to 30 days after that branch is no longer being used to satisfy PHP backed requests. Assuming the weekly release cadence, this means that the static assets from a branch are needed on the cluster for at least 45 days (14 days of active branch use + 31 days of cached page use).
At the moment we don't have a well documented procedure for cleaning up old branches on tin and servers that rsync with tin (directly and indirectly). It seems to be a process that Sam does occasionally. The last commits that cleaned up old branches were merged on 2014-02-15:
https://gerrit.wikimedia.org/r/#/c/113640/,https://gerrit.wikimedia.org/r/#/... .
These commits cleaned up some truly ancient branches.
A slightly different by related problem is the amount of disk space consumed by the l10n cache files for unused MW versions. The combined json and CDB files for the current 1.23 branches consume ~1.7G per version. It looks like Sam has been pruning these at some point as well as the cache/l10n directory for version 1.23wmf12 and earlier are empty.
I recommend that we add two new weekly cleanup steps:
- When we deploy a new branch to group0 (Thursdays), all branches
retired more than 5 weeks ago should be removed. This should really only include multiple branches the first time it's done to catch up. After that it will be an "add a branch, kill a branch" situation. With the current release cadence this will keep us at 7 checked out branches on tin, 2 versions in active use and 5 waiting for potential cache references to expire.
- When we move group1 to the newest branch (Tuesdays), the cache/l10n
directory of all non-active branches should be purged. By this point there is little chance that we will be reverting the wikipedias to the N-2 branch and thus the l10n cache is just taking up disk space and slowing down rsync comparisons.
Are there any objections to adding these procedures to the MW deploy
process?
Minor content correction: mentions of "30 days" should have really been "31 days". Apparently i changed it in some places before I hit send but I didn't get them all. The 31 day upper limit comes from the $wgSquidMaxage setting in InitialiseSettings.php [0]
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
On Mon, Mar 10, 2014 at 2:10 PM, Matthew Walker mwalker@wikimedia.org wrote:
I recall during some fundraising adventures with CentralNotice that in some cases things were persisting in cache beyond the expiry of $wgSquidMaxage. We were debating setting $wgCacheEpoch [0] before I just went through and issued manual purges on all the affected pages (also causing an outage of swift because it couldn't handle a lot of deletes...).
Was this quite a while ago? Greg pointed out bug 44570 [1] when I started asking questions about cleaning up old branches. It looks to me like the interesting behavior of cache TTL reset when the backing article hasn't been edited in the 31 day window should be fixed in production since 2013-04-24. This is exactly the sort of gotcha I was hoping would be surfaced by asking around though so please correct me if there is still a way that the static assets can be needed for more than the "use plus 31 days" window.
[1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=44570
Bryan
My scenario happened about 8 months ago I think. Certainly before https://gerrit.wikimedia.org/r/#/c/59414/ got pushed to the cluster; which from the bug / description I think would've solved it.
~Matt Walker Wikimedia Foundation Fundraising Technology Team
On Mon, Mar 10, 2014 at 1:22 PM, Bryan Davis bd808@wikimedia.org wrote:
On Mon, Mar 10, 2014 at 2:10 PM, Matthew Walker mwalker@wikimedia.org wrote:
I recall during some fundraising adventures with CentralNotice that in
some
cases things were persisting in cache beyond the expiry of
$wgSquidMaxage.
We were debating setting $wgCacheEpoch [0] before I just went through and issued manual purges on all the affected pages (also causing an outage of swift because it couldn't handle a lot of deletes...).
Was this quite a while ago? Greg pointed out bug 44570 [1] when I started asking questions about cleaning up old branches. It looks to me like the interesting behavior of cache TTL reset when the backing article hasn't been edited in the 31 day window should be fixed in production since 2013-04-24. This is exactly the sort of gotcha I was hoping would be surfaced by asking around though so please correct me if there is still a way that the static assets can be needed for more than the "use plus 31 days" window.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
On Fri, Mar 7, 2014 at 5:27 PM, Bryan Davis bd808@wikimedia.org wrote:
I recommend that we add two new weekly cleanup steps:
- When we deploy a new branch to group0 (Thursdays), all branches
retired more than 5 weeks ago should be removed. This should really only include multiple branches the first time it's done to catch up. After that it will be an "add a branch, kill a branch" situation. With the current release cadence this will keep us at 7 checked out branches on tin, 2 versions in active use and 5 waiting for potential cache references to expire.
I did this process today for the first time. I ended up leaving 8 branches on tin instead of 7. I removed these branches:
* 1.23wmf6 * 1.23wmf7 * 1.23wmf8 * 1.23wmf9 * 1.23wmf10
1.23wmf11 was last live on 2014-02-06. I should be safe to delete today, but I didn't want to tempt fate or anger any deity that may control front-end 404 generation.
Here's what I did to delete each branch:
* Cleanup bits symlinks to branch ** /a/common/multiversion/deleteMediaWiki php-1.23wmfX * Create symlinks cleanup patch ** git rm -r docroot/bits/static-1.23wmfX ** git rm -r w/static-1.23wmfX ** NOLOGMSG=1 git commit -m 'Remove 1.23wmfX symlinks' * Revert change on tin ** NOLOGMSG=1 git reset HEAD^ --hard * Repeat steps above for each branch as needed * Approve and pull symlink cleanup patch to tin * Delete branch checkout in /a/common ** rm -rf /a/common/php-1.23wmfX * Repeat steps above for each branch as needed * scap
When scap runs it will leave the php-1.23wmfX/cache/l10n/*.cdb files in place because we ignore *.cdb purposefully in the rsync command run by sync-common. This can be cleaned up with dsh:
dsh -g mediawiki-installation -M -F 40 -- \ 'sudo -u mwdeploy -- rm -r /usr/local/apache/common-local/php-1.23wmfX'
Chad has suggested that an additional step of tagging the end of the branch and deleting it be added to this process. Right now there is a minor blocker on that as the branch delete can only be done via the gerrit web ui and requires user permissions that I do not currently have. I will make the tags though and let Chad either delete the branches or figure out how to let me and Sam do it.
- When we move group1 to the newest branch (Tuesdays), the cache/l10n
directory of all non-active branches should be purged. By this point there is little chance that we will be reverting the wikipedias to the N-2 branch and thus the l10n cache is just taking up disk space and slowing down rsync comparisons.
I did this procedure for the first time this week as well purging the l10n cache for 1.23wmf13, 1.23wmf14 and 1.23wmf15. The process I followed was:
* Delete l10n files in /a/common ** sudo -u l10nupdate rm --recursive ${COMMON}/php-${VERSION}/cache/l10n/* * Delete l10n files on cluster hosts ** dsh "${MW_DSH_ARGS[@]}" -- \ "sudo -u mwdeploy rm --recursive ${SHARED}/php-${VERSION}/cache/l10n/*" * !log deletion
This has been turned into a script in the scap project that is awaiting code review [0]. Once it is merged the process will be reduced to:
scap-purge-l10n-cache --version 1.23wmfX
[0]: https://gerrit.wikimedia.org/r/#/c/118337/
mediawiki-core@lists.wikimedia.org