Re: [Wikitech-l] Converting to Git?

25 Mar 2011


      On Tue, Mar 22, 2011 at 10:46 PM, Tim Starling tstarling@wikimedia.org wrote:
...
The tone is quite different to one of the first things I read about
Mercurial:
"Oops! Mercurial cut off your arm!
"Don't randomly try stuff to see if it'll magically fix it. Remember
what you stand to lose, and set down the chainsaw while you still have
one good arm."
https://developer.mozilla.org/en/Mercurial_basics
My experience with Mercurial is that if you type the wrong commands,
it likes to destroy data.  For instance, when doing an hg up with
conflicts once, it opened up some kind of three-way diff in vim that I
had no idea how to use, and so I exited.  This resulted in my working
copy (or parts of it) being lost, since apparently it defaulted to
assuming that I was okay with whatever default merging it had done, so
it threw out the rest.  I also once lost commits under similar
circumstances when doing hg rebase.  I'm pretty sure you can configure
it to be safer, but it's one of the major reasons I dislike Mercurial.
 (I was able to recover my lost data from filesystem backups.)
git, on the other hand, never destroys committed data.  Barring bugs
(which I don't recall ever running into), the only command that
destroys data is git gc, and that normally only destroys things that
have been disconnected for a number of days.  If you do a rebase, for
instance, the old commits are no longer accessible from normal
commands like "git log", but they'll stick around for some period of
time, so you can recover them if needed (although the process is a bit
arcane if you don't know the commit id's).  There are also no git
commands I've run into that will do anything nasty to your working
copy without asking you, except obvious ones like git reset --hard.
In the event of update conflicts, for instance, git adds conflict
markers just like Subversion.
...
The main argument is that merging is easy so you can branch without
the slightest worry. I think this is an exaggeration. Interfaces
change, and when they change, developers change all the references to
those interfaces in the code which they can see in their working copy.
The greater the time difference in the branch points, the more likely
it is that your new code will stop working. As the branch point gap
grows, merging becomes more a task of understanding the interface
changes and rewriting the code, than just repeating the edits and
copying in the new code.
I'm not talking about the interfaces between core and extensions,
which are reasonably stable. I'm mainly talking mainly about the
interfaces which operate within and between core modules. These change
all the time. The problem of changing interfaces is most severe when
developers are working on different features within the same region of
core code.
Doing regular reintegration merges from trunk to development branches
doesn't help, it just means that you get the interface changes one at
a time, instead of in batches.
Having a short path to trunk means that the maximum amount of code is
visible to the developers who are doing the interface changes, so it
avoids the duplication of effort that occurs when branch maintainers
have to understand and account for every interface change that comes
through.
In practice, this is generally not true.  Realistically, most patches
change a relatively small amount of code and don't cause merge
conflicts even if you keep them out of trunk for quite a long time.
For instance, I maintain dozens of patches to the proprietary forum
software vBulletin for the website I run.  I store them all in git,
and to upgrade I do a git rebase.  Even on a major version upgrade, I
only have to update a few of the patches, and the updates are small
and can be done mindlessly.  It's really very little effort.  Even a
commit that touches a huge amount of code (like my conversion of named
entity references to numeric) will only conflict with a small
percentage of patches.
Of course, you have to be more careful with changing interfaces around
when people use branches a lot.  But in practice, you spend very
little of your time resolving merge conflicts, relative to doing
actual development work.  It's not a significant disadvantage in
practice.  Experienced Subversion users just expect it to be, since
merging in Subversion is horrible and they assume that's how it has to
be.  (Disclaimer: merges in Subversion are evidently so horrible that
I never actually learned how to do them, so I can't give a good
breakdown of why exactly DVCS merging is so much better.  I can just
say that I've never found it to be a problem at all while using a
DVCS, but everyone complains about it with Subversion.)
I mean, the DVCS model was popularized by the Linux kernel.  It's hard
to think of individual codebases that large, or with that much
developer activity.  In recent years it's over 9,000 commits per
release changing several hundred thousand lines of code, which works
out to several thousand LOC changed a day.  But merging is not a big
problem for them -- they spend their time doing development, not
wrestling with version control.
...
If we split up the extensions directory, each extension having its own
repository, then this will discourage developers from updating the
extensions in bulk. This affects both interface changes and general
code maintenance. I'm sure translatewiki.net can set up a script to do
the necessary 400 commits per day, but I'm not sure if every developer
who wants to fix unused variables or change a core/extension interface
will want to do the same.
I've thought about this a bit.  We want bulk code changes to
extensions to be easy, but it would also be nice if it were easier to
host extensions "officially" to get translations, distribution, and
help from established developers.  We also don't want anyone to have
to check out all extensions just to get at trunk.  Localization, on
the other hand, is entirely separate from development, and has very
different needs -- it doesn't need code review, and someone looking at
the revision history for the whole repository doesn't want to see
localization updates.  (Especially in extensions, where often you have
to scroll through pages of l10n updates to get to the code changes.)
Unfortunately, git's submodule feature is pretty crippled.  It
basically works like SVN externals, as I understand it: the larger
repository just has markers saying where the submodules are, but their
actual history is entirely separate.  We could probably write a script
to commit changes to all extensions at once, but it's certainly a less
ideal solution.
If we moved to git, I'd tentatively say something like
* Separate out the version control of localization entirely.
Translations are already coordinated centrally on translatewiki.net,
where the wiki itself maintains all the actual history and
permissions, so the SVN checkin right now is really a needless
formality that keeps translations less up-to-date and spams revision
logs.  Keep the English messages with the code in git, and have the
other messages available for checkout in a different format via our
own script.  This checkout should always grab the latest
translatewiki.net messages, without the need for periodic commits.  (I
assume translatewiki.net already does automatic syntax checks and so
on.)  Of course, the tarballs would package all languages.
* Keep the core code in one repository, each extension in a separate
repository, and have an additional repository with all of them as
submodules.  Or maybe have extensions all be submodules of core (you
can check out only a subset of submodules if you want).
* Developers who want to make mass changes to extensions are probably
already doing them by script (at least I always do), so something like
"for EXTENSION in extensions/*; do cd $EXTENSION; git commit -a -m
'Boilerplate message'; cd ..; done" shouldn't be an exceptional
burden.  If it comes up often enough, we can write a script to help
out.
* We should take the opportunity to liberalize our policies for
extension hosting.  Anyone should be able to add an extension, and get
commit access only to that extension.  MediaWiki developers would get
commit access to all hosted extensions, and hooking into our
localization system should be as simple as making sure you have a
properly-formatted ExtensionName.i18n.php file.  If any human
involvement is needed, it should only be basic sanity checks.
* Code review should migrate to an off-the-shelf tool like Gerrit.  I
don't think it's a good idea at all for us to reinvent the code-review
wheel.  To date we've done it poorly.
This is all assuming that we retain our current basic development
model, namely commit-then-review with a centrally-controlled group of
people with commit access.  One step at a time.
On Tue, Mar 22, 2011 at 11:16 PM, Tim Starling tstarling@wikimedia.org wrote:
...
I think our focus at the moment should be on deployment of extensions
and core features from the 1.17 branch to Wikimedia. We have heard on
several occasions that it is the delay between code commit and
deployment, and the difficulty in getting things deployed, which is
disheartening for developers who come to us from the Wikimedia
community. I'm not so concerned about the backlog of trunk reviews. We
cleared it before, so we can clear it again.
I don't think moving to git will make code review very much easier in
the short term.  It would probably disrupt code review considerably,
in fact, because people would have to get used to the new system.  So
I definitely think code review needs to be worked out before we
overhaul anything.  And that doesn't mean clearing out backlogs, it
means not letting them accumulate in the first place -- like scaps
once a month at the very minimum, and preferably at least once a week.
On Wed, Mar 23, 2011 at 2:51 PM, Diederik van Liere dvanliere@gmail.com wrote:
...
The Python Community recently switched to a DVCS and they have
documented their choice.
It compares Git, Mercurial and Bzr and shows the pluses and minuses of
each. In the end, they went for Mercurial.
Choosing a distributed VCS for the Python project:
http://www.python.org/dev/peps/pep-0374/
They gave three reasons:
1) git's Windows support isn't as good as Mercurial's.  I don't know
how much merit that has these days, so it bears investigation.  I have
the impression that the majority of MediaWiki developers use
non-Windows platforms for development, so as long as it works well
enough, I don't know if this should be a big deal.
2) Python developers preferred Mercurial when surveyed.  Informally,
I'm pretty certain that most MediaWiki developers with a preference
prefer git.
3) Mercurial is written in Python, and Python developers want to use
stuff written in Python.  Not really relevant to us, even those of us
who like Python a lot.  :)  (FWIW, despite being a big Python fan, I'm
a bit perturbed that Mercurial often prints out a Python stack trace
when it dies instead of a proper error message . . .)
GNOME also surveyed available options, and they decided to go with
git: http://blogs.gnome.org/newren/2009/01/03/gnome-dvcs-survey-results/
 Although of course, (1) would be a bit of a nonissue for them.
On Wed, Mar 23, 2011 at 3:41 PM, Rob Lanphier robla@wikimedia.org wrote:
...
We will probably need to adopt some guidelines about the use of rebase
assuming we move to Git.
I don't see why.  Rebase can never be used on publicly-visible
repositories -- anyone who tries to pull from the repo both before and
after the rebase will get errors, since the current upstream HEAD is
not a descendant of the old upstream HEAD.  So rebasing is only
relevant to what developers do in their own private branches, before
they push them to the central public repository.
What we'd need is policies on *merging*.  Do we encourage people to
submit clean merges with an empty merge commit so the development
history is preserved, or encourage rebasing so that the development
history is linear and easier to analyze (e.g., bisect)?  Whatever
policies we adopt, people can always rebase in their private repos as
much as they want, if they like.  I guess we could discourage it, but
I don't see why, as long as it doesn't cause bugs.
...
We don't need to switch to the one extension per repo model right
away, though.  We could throw all of the extensions into a single
repository at first, and then split it later if we run into this or
other similar problems.
No we can't.  Any clone of the repository will have all history.  If
you want to split out extensions at a later date, you're not going to
save much space, since they'll still be cloned with all the rest of
the history.  To really get rid of them, you'd have to create a whole
new repository, forcing everyone to do a fresh clone and seriously
hampering git's ability to merge any uncommitted work from before you
broke up the repo.  If we want to split off some things into their own
repos, the time to do that is when we switch to git, not afterward.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Converting to Git?