All,
When we end up moving MW core to Phabricator I'd like us to jettison our history. The repo is large and clunky and not conducive to development. It's only going to grow in size unless we do something to cut back on the junk we're carrying around.
This is my ideal Phabby world:
mediawiki (no /core, that was always redundant) mediawiki/i18n (as submodule) mediawiki/historical (full history, previous + all mediawiki going forward)
If we jettison all our history we can get the repo size down to a 30-35MB which is very nice. Doing it on Gerrit isn't worthwhile because it'd basically break everything. We're gonna be breaking things with the move to Phab...it's then or never if we're going to do this.
Being able to stitch with the old history would be nice and I think might be doable with git-replace. If not, I still think it's worth discussing for developer and deployer productivity.
Thoughts?
-Chad
Please, no. I regularly use git blame and git annotate on core to figure out why certain features are the way they are. --scott
On Fri, May 30, 2014 at 7:34 PM, C. Scott Ananian cananian@wikimedia.org wrote:
Please, no. I regularly use git blame and git annotate on core to figure out why certain features are the way they are. --scott
git-blame should respect git-replace'd objects and would enable you to add the full-history version as a second remote and see the full history.
Again, this is all in theory.
-Chad
On 2014-05-30, 7:25 PM, Chad wrote:
All,
When we end up moving MW core to Phabricator I'd like us to jettison our history. The repo is large and clunky and not conducive to development. It's only going to grow in size unless we do something to cut back on the junk we're carrying around.
This is my ideal Phabby world:
mediawiki (no /core, that was always redundant) mediawiki/i18n (as submodule) mediawiki/historical (full history, previous + all mediawiki going forward)
If we jettison all our history we can get the repo size down to a 30-35MB which is very nice. Doing it on Gerrit isn't worthwhile because it'd basically break everything. We're gonna be breaking things with the move to Phab...it's then or never if we're going to do this.
Being able to stitch with the old history would be nice and I think might be doable with git-replace. If not, I still think it's worth discussing for developer and deployer productivity.
Thoughts?
-Chad
Eliminating localization updates from repos is always nice, I hate it when they fill up a repo's history. However using a submodule doesn't fix that it just replaces i18n file commits with a submodule update commit. Personally I've always wanted to switch to JSON messages (^_^ yay we already did that), drop messages for all language besides the canonical texts (en and qqq), then integrate the automatic fetching of messages for other languages into MediaWiki (tarballs releases can be bundled with a snapshot of the data for intranets, etc...; ExtensionDistributor can do the same; and thanks to things like localization caches we won't even need to require filesystem write to do this). Especially for extensions, the i18n commits for our extensions completely drown out the code contributions.
However I don't really like the thought of dropping the history. We're using git, switching to phabricator shouldn't actually break anything (except custom things like `git review`). git {clone|fetch|pull} won't work from the old url anymore, but all people have to do is `git remote set-url {new url}` or `git remote add {new remote} {new url}` and voila, they pick up right where they left off, this time with Phabricator backing git instead of Gerrit.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On Fri, May 30, 2014 at 7:50 PM, Daniel Friesen daniel@nadir-seen-fire.com wrote:
Eliminating localization updates from repos is always nice, I hate it when they fill up a repo's history. However using a submodule doesn't fix that it just replaces i18n file commits with a submodule update commit.
I guess we see different problems them. I don't care about the commits themselves, just the amount of data they contain :)
Submodule updates are always going to be lighter-weight.
Personally I've always wanted to switch to JSON messages (^_^ yay we already did that), drop messages for all language besides the canonical texts (en and qqq), then integrate the automatic fetching of messages for other languages into MediaWiki (tarballs releases can be bundled with a snapshot of the data for intranets, etc...; ExtensionDistributor can do the same; and thanks to things like localization caches we won't even need to require filesystem write to do this). Especially for extensions, the i18n commits for our extensions completely drown out the code contributions.
This would also be ok to me.
However I don't really like the thought of dropping the history. We're using git, switching to phabricator shouldn't actually break anything (except custom things like `git review`). git {clone|fetch|pull} won't work from the old url anymore, but all people have to do is `git remote set-url {new url}` or `git remote add {new remote} {new url}` and voila, they pick up right where they left off, this time with Phabricator backing git instead of Gerrit.
I know we can carry the history (and we should, for referencing), I'm wondering if we *should* keep the history on the repos that the average developer uses for writing patches and deploying.
(The deployment thing is just a nice benefit. I probably will do this in deployment regardless of what we do on the canonical repo)
I've yet to find a git repo out there that's as large as ours that doesn't ship large blobs around (which we don't). Some of this is due to the nasty blobs in our history. Some of this is due to the ever-increasing number of i18n commit blobs.
-Chad
Le 31/05/2014 04:25, Chad a écrit :
When we end up moving MW core to Phabricator I'd like us to jettison our history. The repo is large and clunky and not conducive to development. It's only going to grow in size unless we do something to cut back on the junk we're carrying around.
Hello,
My repacked copy of core is 270MB which is not that huge and I really like having the whole history for bisecting and blaming code.
What about attempting to slowdown the rate of growth? The i18n messages can probably be split to another repository or at least be updated only once in a while instead of on a daily basis.
We might also have some big objects floating around in the repo which we could potentially drop from the repo. IIRC we had a few .jar committed by mistake in SVN, though we dropped them when migrating to git. There is probably some other big objects we could remove.
cheers,
On Sat, May 31, 2014 at 1:05 AM, Antoine Musso hashar+wmf@free.fr wrote:
Le 31/05/2014 04:25, Chad a écrit :
When we end up moving MW core to Phabricator I'd like us to jettison our history. The repo is large and clunky and not conducive to development. It's only
going
to grow in size unless we do something to cut back on the junk we're carrying
around.
Hello,
My repacked copy of core is 270MB which is not that huge and I really like having the whole history for bisecting and blaming code.
270MB is gigantic for a git repo.
What about attempting to slowdown the rate of growth? The i18n messages can probably be split to another repository or at least be updated only once in a while instead of on a daily basis.
A solution for containing the growth, yes. I'd +1 this along with Daniel F's idea earlier.
We might also have some big objects floating around in the repo which we could potentially drop from the repo. IIRC we had a few .jar committed by mistake in SVN, though we dropped them when migrating to git. There is probably some other big objects we could remove.
How would we do that without rewriting history? Same problem.
-Chad
On 31 May 2014 16:08, Chad innocentkiller@gmail.com wrote:
270MB is gigantic for a git repo.
But it's not an issue /per se/. The issue is slow clones/slow pulls, not so much the 270MB on your hard drive. The slow clones/pulls can be improved by re-packing the git repository on the server side -- this helped significantly for the pywikibot repositories. I'm not sure if this has been attempted for mw/core yet.
On Sat, May 31, 2014 at 8:03 AM, Merlijn van Deen valhallasw@arctus.nl wrote:
On 31 May 2014 16:08, Chad innocentkiller@gmail.com wrote:
270MB is gigantic for a git repo.
But it's not an issue /per se/. The issue is slow clones/slow pulls, not so much the 270MB on your hard drive. The slow clones/pulls can be improved by re-packing the git repository on the server side -- this helped significantly for the pywikibot repositories. I'm not sure if this has been attempted for mw/core yet.
I do it weekly for core. It's the only thing keeping it from exploding to many hundreds of MB on the Gerrit box.
-Chad
On 31/05/14 15:03, Merlijn van Deen wrote:
On 31 May 2014 16:08, Chad innocentkiller@gmail.com wrote:
270MB is gigantic for a git repo.
But it's not an issue /per se/. The issue is slow clones/slow pulls, not so much the 270MB on your hard drive. The slow clones/pulls can be improved by re-packing the git repository on the server side -- this helped significantly for the pywikibot repositories. I'm not sure if this has been attempted for mw/core yet. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Currently the slow clones can be almost completely avoided by cloning from the github mirror, if you have a reasonably fast connection. But considering how most folks are likely to want older history for blames and whatnot sooner or later anyway, the longer download time for the bulk of the data will still come up when they have to download the rest, but this way it would be with a few extra steps.
What /does/ github do, do we know? Would that be useful/applicable? Is phabricator similar?
-I
On Sat, May 31, 2014 at 5:22 PM, Isarra Yos zhorishna@gmail.com wrote:
What /does/ github do, do we know? Would that be useful/applicable? Is phabricator similar?
They use way more than one server for their cluster, probably have caching. Plus all kinds of proprietary secret sauce including their own in-house implementation of Git.
Phabricator uses the normal system git. We won't be using jgit anymore like with Gerrit.
-Chad
I don't like this idea, for the same reasons that other have already given. Grafting histories with git-replace might be viable, but it'd still be clunky and non-intuitive.
Why don't we just suggest that people use shallow clones? Git supports pushing from and pulling to them since 1.9, and while Gerrit doesn't accept pushes from them (or at least it didn't when I just tried), I see no reason why Phabricator would have any issues if it only works on diffs anyway, not commits.
On Sat, May 31, 2014 at 5:52 AM, Bartosz Dziewoński matma.rex@gmail.com wrote:
I don't like this idea, for the same reasons that other have already given. Grafting histories with git-replace might be viable, but it'd still be clunky and non-intuitive.
Ok, fair enough. Everyone's made some really good points so let's drop the idea of dropping our history.
However I think we should continue to discuss ways to contain the repo size going forward. That, combined with some aggressive repacking and dropping of refs/changes/* (when we move to Phabricator) should help get it under control.
Why don't we just suggest that people use shallow clones? Git supports pushing from and pulling to them since 1.9, and while Gerrit doesn't accept pushes from them (or at least it didn't when I just tried), I see no reason why Phabricator would have any issues if it only works on diffs anyway, not commits.
This is also a good idea.
-Chad
On Sat, May 31, 2014 at 9:38 AM, Chad innocentkiller@gmail.com wrote:
On Sat, May 31, 2014 at 5:52 AM, Bartosz Dziewoński matma.rex@gmail.com wrote:
Why don't we just suggest that people use shallow clones? Git supports pushing from and pulling to them since 1.9, and while Gerrit doesn't
accept
pushes from them (or at least it didn't when I just tried), I see no
reason
why Phabricator would have any issues if it only works on diffs anyway,
not
commits.
This is also a good idea.
See https://bugzilla.wikimedia.org/show_bug.cgi?id=57430 for discussion on this in the context of mw-vagrant.
(anonymous) wrote:
I don't like this idea, for the same reasons that other have already given. Grafting histories with git-replace might be viable, but it'd still be clunky and non-intuitive.
Ok, fair enough. Everyone's made some really good points so let's drop the idea of dropping our history.
However I think we should continue to discuss ways to contain the repo size going forward. That, combined with some aggressive repacking and dropping of refs/changes/* (when we move to Phabricator) should help get it under control.
[...]
Just to clarify: refs/changes/* = Gerrit patchsets (minus the ones referenced as submitted changes)? If so, sure, they're only scratchpads, but on the other hand they should- n't affect the size of a default clone that just pulls in the parents of master's HEAD?
Tim
On Sat, May 31, 2014 at 12:30 PM, Tim Landscheidt tim@tim-landscheidt.de wrote:
(anonymous) wrote:
I don't like this idea, for the same reasons that other have already given. Grafting histories with git-replace might be viable, but it'd
still
be clunky and non-intuitive.
Ok, fair enough. Everyone's made some really good points so let's drop
the
idea of dropping our history.
However I think we should continue to discuss ways to contain the repo
size
going forward. That, combined with some aggressive repacking and dropping of refs/changes/* (when we move to Phabricator) should help get it under control.
[...]
Just to clarify: refs/changes/* = Gerrit patchsets (minus the ones referenced as submitted changes)? If so, sure, they're only scratchpads, but on the other hand they should- n't affect the size of a default clone that just pulls in the parents of master's HEAD?
Right. That's less of a cloning problem as it's a problem on the remote and slows down operations on *that* repo.
-Chad
Hey,
One thing I have noticed is that it is much faster for me to clone core from GitHub then from WMF. Guess that having the thing also hosted in the EU would help.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
On 05/31/2014 12:38 PM, Chad wrote:
On Sat, May 31, 2014 at 5:52 AM, Bartosz Dziewoński matma.rex@gmail.com wrote:
I don't like this idea, for the same reasons that other have already given. Grafting histories with git-replace might be viable, but it'd still be clunky and non-intuitive.
Ok, fair enough. Everyone's made some really good points so let's drop the idea of dropping our history.
However I think we should continue to discuss ways to contain the repo size going forward. That, combined with some aggressive repacking and dropping of refs/changes/* (when we move to Phabricator) should help get it under control.
Why don't we just suggest that people use shallow clones? Git supports pushing from and pulling to them since 1.9, and while Gerrit doesn't accept pushes from them (or at least it didn't when I just tried), I see no reason why Phabricator would have any issues if it only works on diffs anyway, not commits.
This is also a good idea.
-Chad
Chad, thanks for bringing this up. I'm especially grateful that you named the target repo size at 30 to 35 MB -- it makes this more concrete.
I think some more numbers/facts might help us make the right tradeoffs here to think about ways to contain the repo size going forward -- like: * Antoine said "My repacked copy of core is 270MB" and we have some disagreement about whether that's okay. Is there a max size we want to avoid getting to? * A bigger repo is obviously slower to download in full; is it also slower to search or otherwise work with? How much slower? Most of our developers are probably not on SSDs yet.
On Mon, 09 Jun 2014 17:55:42 +0200, Sumana Harihareswara sumanah@wikimedia.org wrote:
- A bigger repo is obviously slower to download in full; is it also
slower to search or otherwise work with? How much slower? Most of our developers are probably not on SSDs yet.
It is, but only when you're working a lot with the history (such as blaming, using git log pickaxe or checking out really old versions), so removing the history obviously will not help in these cases. Creating commits, pushing or pulling is not any faster nor slower.
On Mon, Jun 9, 2014 at 11:55 AM, Sumana Harihareswara <sumanah@wikimedia.org
wrote:
- A bigger repo is obviously slower to download in full; is it also
slower to search or otherwise work with? How much slower? Most of our developers are probably not on SSDs yet.
git is largely insensitive to repo size. It doesn't traverse the entire history unless it needs to. So for most developer tasks, the only constraint is disk size. (There might be a mild size dependency in 'git fetch' -- but I think that's more related to the number of branches in your tree and the tree structure, not simply to number of commits in the history.)
However, there are several steps in our current development *pipeline* that do things like naive 'git clone', whose speed depends linearly on the size of the repo. ("Naive" as it's a "simple matter of software" to use a cached local repo or pack to speed things up.) I believe currently both "git review" (submitting patches to gerrit) and the time it takes for jenkins to run tests have steps of this sort; "git review" on mediawiki/core is noticibly much slower than on parsoid (for example).
Traded off again this -- if history is truncated, it will be much slower/more complicated for developers to do meaningful history searches, as has been mentioned. I'd expect that most hard-core developers would end up having to download the complete history anyway, as Bartosz suggests.
In summary, it seems to me that the reasonable forward path at the moment is some combination of (a) better documenting the use of shallow clones for newbie/infrequent contributors to reduce the initial developer roadblock (including verifying that this works with gerrit, etc), and (b) spending more effort optimizing the 'git clone' step in jenkins/gerrit (we already do some of this), and (c) paying attention to how phabricator uses git, to ensure that the repo size does not become an issue in the future. --scott
On 05/31/2014 08:52 AM, Bartosz Dziewoński wrote:
I don't like this idea, for the same reasons that other have already given. Grafting histories with git-replace might be viable, but it'd still be clunky and non-intuitive.
Why don't we just suggest that people use shallow clones? Git supports pushing from and pulling to them since 1.9, and while Gerrit doesn't accept pushes from them (or at least it didn't when I just tried), I see no reason why Phabricator would have any issues if it only works on diffs anyway, not commits.
Are you sure you can push from a shallow clone to a normal git repo? The 1.9 release notes (https://raw.githubusercontent.com/git/git/master/Documentation/RelNotes/1.9....) just say:
" * Fetching from a shallowly-cloned repository used to be forbidden, primarily because the codepaths involved were not carefully vetted and we did not bother supporting such usage. This release attempts to allow object transfer out of a shallowly-cloned repository in a more controlled way (i.e. the receiver becomes a shallow repository with a truncated history)."
Note the part about the receiver also being shallow.
I agree Phabricator should work fine. Heck, you can push to Phabricator via copy-and-paste (http://fab.wmflabs.org/differential/diff/create/), so there's no reason "push from shallow clone" can't be implemented.
Matt
wikitech-l@lists.wikimedia.org