TL;DR: I've decided to sunset my "snapshots" tool.
The Snapshots tool created TAR-archives of MediaWiki core branches, fresh
from Gerrit, every hour.
I created it in 2012 on the Toolserver,[1] to make it easier for site
admins to try out the alpha version of MediaWiki (or a WMF branch), using
the same format as our official stable releases. The snapshots were
generated using a PHP script and git-cli commands, scheduled with a cronjob
onto the Toolforge Grid. [2] [3]
### *Lessons* learned
Maintaining this tool has mostly been an exercise in learning how hard it
is to keep a Git repository functional over a large period of time.
I learned about the numerous ways that a Git repository can become
corrupted or unusable when commands are terminated in unforeseen ways. For
example, what happens to the state of a Git repository when it's a clone of
MediaWiki core, put on NFS, and you try to switch from current master to a
release branch from a decade ago – while lots of users are also working on
that same NFS mount, and do this for every branch – every hour? [4]
I learned how poor Git can be at forgetting which files are part of a
branch and which aren't, so that even when there aren't any errors, if you
switch to an old branch with directories or submodules that a newer branch
doesn't have and then switch back, old files could stay and be seen as
untracked files. These then cause failed checkouts later on due to
conflicting changes when another branch does have the file in question.
(This was improved a lot around 2015 with later releases of Git 2.x.)
I learned that, apparently, when Git's garbage collector kicks in from time
to time, it doesn't know how much memory it is allowed to use, and will
"efficiently" use larger and larger blocks to speed up the process until it
gets killed by the grid engine, at which point it will eventually start
again and make the same mistake, until someone comes in and manually runs
git-gc outside the grid. (This was initially worked around by disabling
git-gc. I later found a way to re-enable with more constrained settings,
see [5]).
### *Sunsetting*
The tool hasn't seen much use (to my knowledge) – apart from web crawlers
of search engines. I haven't heard much complaining over the years (if at
all) whenever it got stuck for long periods at a time.
Last week, I noticed it once again got stuck, and apparently had been for
several months. Rather than fixing it, I decided to shut it down this time.
The source code is available on GitHub for anyone interested in picking it
up again.[3] My recommendation would be to *not* try to maintain a local
Git repository like I did. Instead, have everything be ephemeral. That is,
whenever you run the script, create for each branch you're interested in, a
temporary clone with limited depth and just that branch, then create an
archive and get rid of the clone (also before beginning, in case something
was left behind). This will make it a bit slower, and less elegant, but
presumably much more stable. Actually, given how slow branch switches can
be, it might even be faster! There is also support in newer versions of Git
to invoke git-archive directly on a remote URL, which would remove the need
for local clones entirely. [6]
### *Recipe* for (mostly stable) creation of tarballs from Git
My final hourly recipe looked like this:
1. Blindly delete any ".git/index.lock" file.
2. Run "git clean -q -d -x -f'", deletes unknown files with extra force.
3. Run "git reset -q --hard", deletes any locally staged state, with extra
force. (While nothing does any staging in this script, Git would sometimes
magically think a file was staged. If I recall correctly, this related to
extension submodules.)
4. Get name of remote.
5. Run "git fetch origin".
6. Run "git remote prune origin".
7. Get list of branches (then filter by pattern). Then, for each branch:
8. Get $head of tree for branch via "git rev-parse --verify $branch".
9. Check if you've already got an archive for that. If so, continue with
the next branch instead. If not, go on:
10. Repeat steps 1-3 to reset the repo. [4]
11. Run "git checkout -q -f $branch", checks out the branch, with extra
force.
12. Run "git rev-parse --verify HEAD", and confirm it matches $head,
because sometimes checkout command succeeded, but not really.
13. Run "git archive HEAD --format='tar' | gzip > mediawiki-$branch-$
head.tar.tz", which creates the actual archive.
14. If for "master" branch, update the Mediawiki-latest.tar.gz symlink.
15. (End of for-each branch). Delete older tar files for branches that we
created a new one for just now.
It lived at
https://tools.wmflabs.org/snapshots, which now redirects to
https://www.mediawiki.org/wiki/Snapshots instead.
Best,
-- Krinkle
[1] Toolserver. –
https://en.wikipedia.org/wiki/Wikipedia:Toolserver
[2] Toolforge Grid. –
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid
[3] The script. –
https://github.com/Krinkle/mw-tool-snapshots/blob/b33d479cb9/scripts/update…
[4] The reset. –
https://github.com/Krinkle/toollabs-base/blob/v1.0.2/src/GlobalFunctions.ph…
[5] More about Git GC and memory configuration management. –
https://github.com/Krinkle/mw-tool-snapshots#git-memory
[6] Use git-archive on a remote repo, without a local clone. –
https://git-scm.com/docs/git-archive/2.18.0#git-archive---remoteltrepogt