On Tue, Mar 22, 2011 at 10:46 PM, Tim Starling tstarling@wikimedia.org wrote:
The tone is quite different to one of the first things I read about Mercurial:
"Oops! Mercurial cut off your arm!
"Don't randomly try stuff to see if it'll magically fix it. Remember what you stand to lose, and set down the chainsaw while you still have one good arm."
My experience with Mercurial is that if you type the wrong commands, it likes to destroy data. For instance, when doing an hg up with conflicts once, it opened up some kind of three-way diff in vim that I had no idea how to use, and so I exited. This resulted in my working copy (or parts of it) being lost, since apparently it defaulted to assuming that I was okay with whatever default merging it had done, so it threw out the rest. I also once lost commits under similar circumstances when doing hg rebase. I'm pretty sure you can configure it to be safer, but it's one of the major reasons I dislike Mercurial. (I was able to recover my lost data from filesystem backups.)
git, on the other hand, never destroys committed data. Barring bugs (which I don't recall ever running into), the only command that destroys data is git gc, and that normally only destroys things that have been disconnected for a number of days. If you do a rebase, for instance, the old commits are no longer accessible from normal commands like "git log", but they'll stick around for some period of time, so you can recover them if needed (although the process is a bit arcane if you don't know the commit id's). There are also no git commands I've run into that will do anything nasty to your working copy without asking you, except obvious ones like git reset --hard. In the event of update conflicts, for instance, git adds conflict markers just like Subversion.
The main argument is that merging is easy so you can branch without the slightest worry. I think this is an exaggeration. Interfaces change, and when they change, developers change all the references to those interfaces in the code which they can see in their working copy. The greater the time difference in the branch points, the more likely it is that your new code will stop working. As the branch point gap grows, merging becomes more a task of understanding the interface changes and rewriting the code, than just repeating the edits and copying in the new code.
I'm not talking about the interfaces between core and extensions, which are reasonably stable. I'm mainly talking mainly about the interfaces which operate within and between core modules. These change all the time. The problem of changing interfaces is most severe when developers are working on different features within the same region of core code.
Doing regular reintegration merges from trunk to development branches doesn't help, it just means that you get the interface changes one at a time, instead of in batches.
Having a short path to trunk means that the maximum amount of code is visible to the developers who are doing the interface changes, so it avoids the duplication of effort that occurs when branch maintainers have to understand and account for every interface change that comes through.
In practice, this is generally not true. Realistically, most patches change a relatively small amount of code and don't cause merge conflicts even if you keep them out of trunk for quite a long time. For instance, I maintain dozens of patches to the proprietary forum software vBulletin for the website I run. I store them all in git, and to upgrade I do a git rebase. Even on a major version upgrade, I only have to update a few of the patches, and the updates are small and can be done mindlessly. It's really very little effort. Even a commit that touches a huge amount of code (like my conversion of named entity references to numeric) will only conflict with a small percentage of patches.
Of course, you have to be more careful with changing interfaces around when people use branches a lot. But in practice, you spend very little of your time resolving merge conflicts, relative to doing actual development work. It's not a significant disadvantage in practice. Experienced Subversion users just expect it to be, since merging in Subversion is horrible and they assume that's how it has to be. (Disclaimer: merges in Subversion are evidently so horrible that I never actually learned how to do them, so I can't give a good breakdown of why exactly DVCS merging is so much better. I can just say that I've never found it to be a problem at all while using a DVCS, but everyone complains about it with Subversion.)
I mean, the DVCS model was popularized by the Linux kernel. It's hard to think of individual codebases that large, or with that much developer activity. In recent years it's over 9,000 commits per release changing several hundred thousand lines of code, which works out to several thousand LOC changed a day. But merging is not a big problem for them -- they spend their time doing development, not wrestling with version control.
If we split up the extensions directory, each extension having its own repository, then this will discourage developers from updating the extensions in bulk. This affects both interface changes and general code maintenance. I'm sure translatewiki.net can set up a script to do the necessary 400 commits per day, but I'm not sure if every developer who wants to fix unused variables or change a core/extension interface will want to do the same.
I've thought about this a bit. We want bulk code changes to extensions to be easy, but it would also be nice if it were easier to host extensions "officially" to get translations, distribution, and help from established developers. We also don't want anyone to have to check out all extensions just to get at trunk. Localization, on the other hand, is entirely separate from development, and has very different needs -- it doesn't need code review, and someone looking at the revision history for the whole repository doesn't want to see localization updates. (Especially in extensions, where often you have to scroll through pages of l10n updates to get to the code changes.)
Unfortunately, git's submodule feature is pretty crippled. It basically works like SVN externals, as I understand it: the larger repository just has markers saying where the submodules are, but their actual history is entirely separate. We could probably write a script to commit changes to all extensions at once, but it's certainly a less ideal solution.
If we moved to git, I'd tentatively say something like
* Separate out the version control of localization entirely. Translations are already coordinated centrally on translatewiki.net, where the wiki itself maintains all the actual history and permissions, so the SVN checkin right now is really a needless formality that keeps translations less up-to-date and spams revision logs. Keep the English messages with the code in git, and have the other messages available for checkout in a different format via our own script. This checkout should always grab the latest translatewiki.net messages, without the need for periodic commits. (I assume translatewiki.net already does automatic syntax checks and so on.) Of course, the tarballs would package all languages. * Keep the core code in one repository, each extension in a separate repository, and have an additional repository with all of them as submodules. Or maybe have extensions all be submodules of core (you can check out only a subset of submodules if you want). * Developers who want to make mass changes to extensions are probably already doing them by script (at least I always do), so something like "for EXTENSION in extensions/*; do cd $EXTENSION; git commit -a -m 'Boilerplate message'; cd ..; done" shouldn't be an exceptional burden. If it comes up often enough, we can write a script to help out. * We should take the opportunity to liberalize our policies for extension hosting. Anyone should be able to add an extension, and get commit access only to that extension. MediaWiki developers would get commit access to all hosted extensions, and hooking into our localization system should be as simple as making sure you have a properly-formatted ExtensionName.i18n.php file. If any human involvement is needed, it should only be basic sanity checks. * Code review should migrate to an off-the-shelf tool like Gerrit. I don't think it's a good idea at all for us to reinvent the code-review wheel. To date we've done it poorly.
This is all assuming that we retain our current basic development model, namely commit-then-review with a centrally-controlled group of people with commit access. One step at a time.
On Tue, Mar 22, 2011 at 11:16 PM, Tim Starling tstarling@wikimedia.org wrote:
I think our focus at the moment should be on deployment of extensions and core features from the 1.17 branch to Wikimedia. We have heard on several occasions that it is the delay between code commit and deployment, and the difficulty in getting things deployed, which is disheartening for developers who come to us from the Wikimedia community. I'm not so concerned about the backlog of trunk reviews. We cleared it before, so we can clear it again.
I don't think moving to git will make code review very much easier in the short term. It would probably disrupt code review considerably, in fact, because people would have to get used to the new system. So I definitely think code review needs to be worked out before we overhaul anything. And that doesn't mean clearing out backlogs, it means not letting them accumulate in the first place -- like scaps once a month at the very minimum, and preferably at least once a week.
On Wed, Mar 23, 2011 at 2:51 PM, Diederik van Liere dvanliere@gmail.com wrote:
The Python Community recently switched to a DVCS and they have documented their choice. It compares Git, Mercurial and Bzr and shows the pluses and minuses of each. In the end, they went for Mercurial.
Choosing a distributed VCS for the Python project: http://www.python.org/dev/peps/pep-0374/
They gave three reasons:
1) git's Windows support isn't as good as Mercurial's. I don't know how much merit that has these days, so it bears investigation. I have the impression that the majority of MediaWiki developers use non-Windows platforms for development, so as long as it works well enough, I don't know if this should be a big deal.
2) Python developers preferred Mercurial when surveyed. Informally, I'm pretty certain that most MediaWiki developers with a preference prefer git.
3) Mercurial is written in Python, and Python developers want to use stuff written in Python. Not really relevant to us, even those of us who like Python a lot. :) (FWIW, despite being a big Python fan, I'm a bit perturbed that Mercurial often prints out a Python stack trace when it dies instead of a proper error message . . .)
GNOME also surveyed available options, and they decided to go with git: http://blogs.gnome.org/newren/2009/01/03/gnome-dvcs-survey-results/ Although of course, (1) would be a bit of a nonissue for them.
On Wed, Mar 23, 2011 at 3:41 PM, Rob Lanphier robla@wikimedia.org wrote:
We will probably need to adopt some guidelines about the use of rebase assuming we move to Git.
I don't see why. Rebase can never be used on publicly-visible repositories -- anyone who tries to pull from the repo both before and after the rebase will get errors, since the current upstream HEAD is not a descendant of the old upstream HEAD. So rebasing is only relevant to what developers do in their own private branches, before they push them to the central public repository.
What we'd need is policies on *merging*. Do we encourage people to submit clean merges with an empty merge commit so the development history is preserved, or encourage rebasing so that the development history is linear and easier to analyze (e.g., bisect)? Whatever policies we adopt, people can always rebase in their private repos as much as they want, if they like. I guess we could discourage it, but I don't see why, as long as it doesn't cause bugs.
We don't need to switch to the one extension per repo model right away, though. We could throw all of the extensions into a single repository at first, and then split it later if we run into this or other similar problems.
No we can't. Any clone of the repository will have all history. If you want to split out extensions at a later date, you're not going to save much space, since they'll still be cloned with all the rest of the history. To really get rid of them, you'd have to create a whole new repository, forcing everyone to do a fresh clone and seriously hampering git's ability to merge any uncommitted work from before you broke up the repo. If we want to split off some things into their own repos, the time to do that is when we switch to git, not afterward.