TL;DR: I've decided to sunset my "snapshots" tool.
The Snapshots tool created TAR-archives of MediaWiki core branches, fresh from Gerrit, every hour.
I created it in 2012 on the Toolserver,[1] to make it easier for site admins to try out the alpha version of MediaWiki (or a WMF branch), using the same format as our official stable releases. The snapshots were generated using a PHP script and git-cli commands, scheduled with a cronjob onto the Toolforge Grid. [2] [3]
### *Lessons* learned
Maintaining this tool has mostly been an exercise in learning how hard it is to keep a Git repository functional over a large period of time.
I learned about the numerous ways that a Git repository can become corrupted or unusable when commands are terminated in unforeseen ways. For example, what happens to the state of a Git repository when it's a clone of MediaWiki core, put on NFS, and you try to switch from current master to a release branch from a decade ago – while lots of users are also working on that same NFS mount, and do this for every branch – every hour? [4]
I learned how poor Git can be at forgetting which files are part of a branch and which aren't, so that even when there aren't any errors, if you switch to an old branch with directories or submodules that a newer branch doesn't have and then switch back, old files could stay and be seen as untracked files. These then cause failed checkouts later on due to conflicting changes when another branch does have the file in question. (This was improved a lot around 2015 with later releases of Git 2.x.)
I learned that, apparently, when Git's garbage collector kicks in from time to time, it doesn't know how much memory it is allowed to use, and will "efficiently" use larger and larger blocks to speed up the process until it gets killed by the grid engine, at which point it will eventually start again and make the same mistake, until someone comes in and manually runs git-gc outside the grid. (This was initially worked around by disabling git-gc. I later found a way to re-enable with more constrained settings, see [5]).
### *Sunsetting*
The tool hasn't seen much use (to my knowledge) – apart from web crawlers of search engines. I haven't heard much complaining over the years (if at all) whenever it got stuck for long periods at a time.
Last week, I noticed it once again got stuck, and apparently had been for several months. Rather than fixing it, I decided to shut it down this time. The source code is available on GitHub for anyone interested in picking it up again.[3] My recommendation would be to *not* try to maintain a local Git repository like I did. Instead, have everything be ephemeral. That is, whenever you run the script, create for each branch you're interested in, a temporary clone with limited depth and just that branch, then create an archive and get rid of the clone (also before beginning, in case something was left behind). This will make it a bit slower, and less elegant, but presumably much more stable. Actually, given how slow branch switches can be, it might even be faster! There is also support in newer versions of Git to invoke git-archive directly on a remote URL, which would remove the need for local clones entirely. [6]
### *Recipe* for (mostly stable) creation of tarballs from Git
My final hourly recipe looked like this:
1. Blindly delete any ".git/index.lock" file. 2. Run "git clean -q -d -x -f'", deletes unknown files with extra force. 3. Run "git reset -q --hard", deletes any locally staged state, with extra force. (While nothing does any staging in this script, Git would sometimes magically think a file was staged. If I recall correctly, this related to extension submodules.) 4. Get name of remote. 5. Run "git fetch origin". 6. Run "git remote prune origin". 7. Get list of branches (then filter by pattern). Then, for each branch: 8. Get $head of tree for branch via "git rev-parse --verify $branch". 9. Check if you've already got an archive for that. If so, continue with the next branch instead. If not, go on: 10. Repeat steps 1-3 to reset the repo. [4] 11. Run "git checkout -q -f $branch", checks out the branch, with extra force. 12. Run "git rev-parse --verify HEAD", and confirm it matches $head, because sometimes checkout command succeeded, but not really. 13. Run "git archive HEAD --format='tar' | gzip > mediawiki-$branch-$ head.tar.tz", which creates the actual archive. 14. If for "master" branch, update the Mediawiki-latest.tar.gz symlink. 15. (End of for-each branch). Delete older tar files for branches that we created a new one for just now.
It lived at https://tools.wmflabs.org/snapshots, which now redirects to https://www.mediawiki.org/wiki/Snapshots instead.
Best,
-- Krinkle
[1] Toolserver. – https://en.wikipedia.org/wiki/Wikipedia:Toolserver [2] Toolforge Grid. – https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid [3] The script. – https://github.com/Krinkle/mw-tool-snapshots/blob/b33d479cb9/scripts/updateS... [4] The reset. – https://github.com/Krinkle/toollabs-base/blob/v1.0.2/src/GlobalFunctions.php... [5] More about Git GC and memory configuration management. – https://github.com/Krinkle/mw-tool-snapshots#git-memory [6] Use git-archive on a remote repo, without a local clone. – https://git-scm.com/docs/git-archive/2.18.0#git-archive---remoteltrepogt