Git, Gerrit and the coming migration - Wikitech-l

6 Mar 2012


      Hi all,
Some disclaimers before I start my thread:
1) I am a big believer in Git and dvcs and I think this is the right decision
2) I am a big believer in Gerrit and code-review and I think this is
the right decision
3) I might be wholly unaware / inaccurate of certain things, apologies
in advance.
4) A BIIGG thankyou to all the folks involved in preparing this
migration (evaluation, migration and training): in particular Chad,
Sumanah and Roan (but I am sure more people are involved and I am just
blissfully unaware).
My main worry is that we are not spending enough time on getting all
engineers (both internal and in the community) up to speed with the
coming migration to Git and Gerrit and that we are going to blame the
tools (Gerrit and/or Git) instead of the complex interaction between
three changes. We are making three fundamental changes in one-shot:
1) Migrating from a centralized source control system to a
decentralized system (SVN -> Git)
2) Introducing a new dedicated code-review tool (Gerrit)
3) Introducing a gated-trunk model
My concern is not about the UI of Gerrit, I know it's popular within
WMF to say that it's UI sucks but I don't think that's the case and
even if it was an issue it's only minor. People have already suggested
that we might consider other code-review systems, I did a quick Google
search and we are the only community considering migrating from Gerrit
to Phabricator. I think this is besides the point:  the real challenge
is moving to a gated-trunk model, regardless of the chosen code-review
tool. I cannot imagine other code-review tools that are also based on
a gated-trunk model and work with Git are much easier than Gerrit. The
complexity comes from the gated-trunk model, not from the tool.
The gated-trunk model means that, when you clone or pull from master,
it might be the case that files relevant to you have been changed but
that those new changes are waiting to be merged (the pull request
backlog, AKA the code-review backlog). In the always-commit world with
no gatekeeping between developers and master, this never happens; your
local copy can always be fully synchronized with trunk ("master").
Even if a commit is reverted, then your local working copy will still
have it, and any changes that you might have based on this reverted
commit, you can still commit. Obviously people get annoyed when you
keep checking in reverted code, but it won't break anything.
In an ideal world, our code-review backlog would be zero commits at
any time of the day, if that's the case then 'master' is always
up-to-date and you have the same situation as with the 'always-commit'
model. However, we know that the code-review backlog is a fact and
it's the intersection of Git, Gerrit and the backlog that is going to
be painful.
Suppose I clone master, but there are 10 commits waiting to be
reviewed with files that are relevant to me. I am happily coding in my
own local branch and after a while ready to commit. Meanwhile, those
10 commits have been reviewed and merged and now when I want to merge
my branch back to master I get merge conflicts. Either I discover
these merge conflicts when my branch is merged back to master or if I
pull mid-way to update my local branch.
To be a productive engineer after the migration it will *not* be
sufficient if you have only mastered git clone, git pull, git push,
git add and git commit commands. These are the basic git commands.
Two overall recommendations:
1) The Git / Gerrit combination means that you will have to understand
git rebase, git commit --amend, git bisect and git cherry-pick. This
is advanced Git usage and that will make the learning curve steeper. I
think we need to spend more time on training, I have been looking for
good tutorials about Git&Gerrit in practise and I haven't been able to
find it but maybe other people have better Google Fu skills (I think
we are looking for advanced tutorials, not just cloning and pulling,
but also merging, bisect and cherrypick).
2) We need to come up with a smarter way determining how to approach
the code-review backlog. Three overall strategies come to mind:
a) random, just pick a commit
b) time-based picking (either the oldest or the youngest commit)
c) 'impact' of commit
a) and b) do not require anything but are less suited for a
gated-trunk model. Option c) could be something where we construct a
graph of the codebase and determine the most central files (hubs) and
that commits are sorted by centrality in this graph. The graph only
needs to be reconstructed after major refactoring or every month or
so. Obviously, this requires a bit of coding and I don't have formal
proof that this actually will reduce the pain but I am hopeful. If
constructing a graph is too cumbersome then we can sort by the number
of affected files in a commit as a proxy.  If we cannot come up with a
c) strategy then the only real option is to make sure that the queue
is as Wikimedia short as possible.
Best,
Diederik