Hey,
A month ago we had some discussion on our internal list on only having a single component per git repo. I now want to follow up on this. Since there is no reason to do this on the internal list, I'm now doing it here and including the original mail.
WikibaseQueryEngine and WikibaseQuery have been split off, and more excitingly, we finally managed to do this with the DataModel component as well. Now the dust has settled from those changes, I think it is time to look at the next one: DataValues.
Right now this repo contains 6 components. Splitting this up thus means 5 new repos. We can do this one by one, though it probably makes sense to do it close together. I suggest starting with ValueView and DataTypes, as these are currently being the most awkward dependency wise.
Any objections against starting work on this next week (ie the one that starts tomorrow)? (Do read my original mail (below) first in case you have not yet already done so.)
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
---------- Forwarded message ---------- From: Jeroen De Dauw jeroendedauw@gmail.com Date: 9 June 2013 01:28 Subject: One component per repository To: Wikidata-intern wikidata-intern@wikimedia.de
Hey,
As you all know, right now we have git repositories containing multiple components. There are two main reasons for this:
* We started with different repos for client, lib and repo and then had a lot of hassle with moving code between them. This reason seems mainly historical at this point. This kind of code moving happening only very rarely at this point, at least in all components that are not client, lib or repo. It might still be going on to some extend in those three, though I suspect this has decreased since the start of the project, probably due to us more clearly defining the component boundaries. In fact, I think the primary reason we ended up moving so many things in these components was caused by those boundaries being so poorly defined.
* Installation hassle. Burdening regular users of the software to know about all the dependencies, the required versions and having to load them in a specific version in LocalSettings is a bad idea. Ideally they should not have to know about the dependencies at all. So putting everything in a single repo is a reasonable idea if you lack a package management system. Luckily we are now very close to having Composer work for wikibase, which enables us to have as many packages as we deem fit, without bothering the users with them. So that will not only mitigate making the installation process more complex, it will reduce current complexity (since right now people do need to care about the dependencies (Diff, Ask, DataValues, ...)).
In short, the main reasons for multiple components per repo no longer apply in our current situation. OTOH there are some clear disadvantages:
* People in the PHP world tend to assume there is only one component in a git repo. This is reflected in mainstream tools such as Composer and TravisCI having this assumption build in. It is not possible (AFAIK) to properly use Composer if you have all your packages in a single repo. For instance, it is currently not possible to list ValueParsers as a distinct component from DataValues in the package repositories. It is also not easily doable to have Travis do a build for each component, you'd have to add a two tier build process, which is a lot of added complexity.
* Since the packages contain multiple components, it is less easy to keep track of changes to a single component. Or to make use of just a single component, without having to care about the other ones. Reusability and likelihood of outside contributions are decreased if your package contains a ton of not applicable stuff to the person that just wants to use one of the components in there.
* The notion that we cannot create new repositories for new components promotes harmful practices. These components will just be put somewhere based on what currently needs them, and possibly need to be moved closer to the root of the dependency tree. A good example of this is the ValueView code. If I'm not mistaken, some of this was originally in repo, then in lib (same package but different component), and then in the DataValues repo. The last step was made since we happened to be loading the DataValues repo everywhere we need this JavaScript code. ValueView certainly does not belong in the same repo however. Both components do very different things and are even written in different languages, so they will have different consumers, different contributors and different reasons for change.
* More clear and explicit dependency structure. Makes messing up of dependencies easier to detect, and makes understanding the individual components easier.
...
Given that, I suggest we stop the earlier pattern of putting multiple components in the same repo, and work on moving misplaced components into new repos. I will be doing this for the Database, Query and QueryEngine components in the coming days, as these are not needed by any functionality currently enabled anywhere. I suggest we do the same for the components in DataValues and for DataModel. We can probably leave lib, repo and client as they are, at least for now. Those are not really reusable in their current form anyway, and are the hardest ones to make improvements to. Plus we might want to tackle fixing issues with lib first. Putting DataModel in its own repo is currently blocked by remaining legacy dependencies, which need to be resolved first. That leaves the components in the DataValues repo as the thing to start with. I suggest we tackle that one as soon as everyone of the team got a more clear idea of how this will work with Composer and when there are no new commits on gerrit against the relevant component(s).
"Oh no, we will end up with a million git repos now!" Not really. We have 14 components right now, and this number has not increased a lot lately. The only recent additions are Database and QueryEngine. We might end up with a few more over time, but not hundreds, or even dozens.
"I still don't want to care about 14 different repos!" The situation for developers will not get worse (else I'd not be a fan of the idea to begin with), and will actually improve in places. For instance, when working on the QueryEngine component, you will now no longer need all this unrelated client, lib, repo, datatypes, valueparsers, valuevalidators, valueformatters and valueview code. If someone breaks the build for one of those, you can happily not care, since that code is not relevant to the work you are doing. You'll also have better CI support. And if you are not working on any of the components you depend on, for instance when working just on client, you can use Composer much like a regular user and not give a damn about all indirect dependencies.
</walloftext><other people raging>
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Nothing changed on my side, do it as long as it is possible to keep the git history.
2013/7/7 Jeroen De Dauw jeroendedauw@gmail.com
Hey,
A month ago we had some discussion on our internal list on only having a single component per git repo. I now want to follow up on this. Since there is no reason to do this on the internal list, I'm now doing it here and including the original mail.
WikibaseQueryEngine and WikibaseQuery have been split off, and more excitingly, we finally managed to do this with the DataModel component as well. Now the dust has settled from those changes, I think it is time to look at the next one: DataValues.
Right now this repo contains 6 components. Splitting this up thus means 5 new repos. We can do this one by one, though it probably makes sense to do it close together. I suggest starting with ValueView and DataTypes, as these are currently being the most awkward dependency wise.
Any objections against starting work on this next week (ie the one that starts tomorrow)? (Do read my original mail (below) first in case you have not yet already done so.)
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
---------- Forwarded message ---------- From: Jeroen De Dauw jeroendedauw@gmail.com Date: 9 June 2013 01:28 Subject: One component per repository To: Wikidata-intern wikidata-intern@wikimedia.de
Hey,
As you all know, right now we have git repositories containing multiple components. There are two main reasons for this:
- We started with different repos for client, lib and repo and then had a
lot of hassle with moving code between them. This reason seems mainly historical at this point. This kind of code moving happening only very rarely at this point, at least in all components that are not client, lib or repo. It might still be going on to some extend in those three, though I suspect this has decreased since the start of the project, probably due to us more clearly defining the component boundaries. In fact, I think the primary reason we ended up moving so many things in these components was caused by those boundaries being so poorly defined.
- Installation hassle. Burdening regular users of the software to know
about all the dependencies, the required versions and having to load them in a specific version in LocalSettings is a bad idea. Ideally they should not have to know about the dependencies at all. So putting everything in a single repo is a reasonable idea if you lack a package management system. Luckily we are now very close to having Composer work for wikibase, which enables us to have as many packages as we deem fit, without bothering the users with them. So that will not only mitigate making the installation process more complex, it will reduce current complexity (since right now people do need to care about the dependencies (Diff, Ask, DataValues, ...)).
In short, the main reasons for multiple components per repo no longer apply in our current situation. OTOH there are some clear disadvantages:
- People in the PHP world tend to assume there is only one component in a
git repo. This is reflected in mainstream tools such as Composer and TravisCI having this assumption build in. It is not possible (AFAIK) to properly use Composer if you have all your packages in a single repo. For instance, it is currently not possible to list ValueParsers as a distinct component from DataValues in the package repositories. It is also not easily doable to have Travis do a build for each component, you'd have to add a two tier build process, which is a lot of added complexity.
- Since the packages contain multiple components, it is less easy to keep
track of changes to a single component. Or to make use of just a single component, without having to care about the other ones. Reusability and likelihood of outside contributions are decreased if your package contains a ton of not applicable stuff to the person that just wants to use one of the components in there.
- The notion that we cannot create new repositories for new components
promotes harmful practices. These components will just be put somewhere based on what currently needs them, and possibly need to be moved closer to the root of the dependency tree. A good example of this is the ValueView code. If I'm not mistaken, some of this was originally in repo, then in lib (same package but different component), and then in the DataValues repo. The last step was made since we happened to be loading the DataValues repo everywhere we need this JavaScript code. ValueView certainly does not belong in the same repo however. Both components do very different things and are even written in different languages, so they will have different consumers, different contributors and different reasons for change.
- More clear and explicit dependency structure. Makes messing up of
dependencies easier to detect, and makes understanding the individual components easier.
...
Given that, I suggest we stop the earlier pattern of putting multiple components in the same repo, and work on moving misplaced components into new repos. I will be doing this for the Database, Query and QueryEngine components in the coming days, as these are not needed by any functionality currently enabled anywhere. I suggest we do the same for the components in DataValues and for DataModel. We can probably leave lib, repo and client as they are, at least for now. Those are not really reusable in their current form anyway, and are the hardest ones to make improvements to. Plus we might want to tackle fixing issues with lib first. Putting DataModel in its own repo is currently blocked by remaining legacy dependencies, which need to be resolved first. That leaves the components in the DataValues repo as the thing to start with. I suggest we tackle that one as soon as everyone of the team got a more clear idea of how this will work with Composer and when there are no new commits on gerrit against the relevant component(s).
"Oh no, we will end up with a million git repos now!" Not really. We have 14 components right now, and this number has not increased a lot lately. The only recent additions are Database and QueryEngine. We might end up with a few more over time, but not hundreds, or even dozens.
"I still don't want to care about 14 different repos!" The situation for developers will not get worse (else I'd not be a fan of the idea to begin with), and will actually improve in places. For instance, when working on the QueryEngine component, you will now no longer need all this unrelated client, lib, repo, datatypes, valueparsers, valuevalidators, valueformatters and valueview code. If someone breaks the build for one of those, you can happily not care, since that code is not relevant to the work you are doing. You'll also have better CI support. And if you are not working on any of the components you depend on, for instance when working just on client, you can use Composer much like a regular user and not give a damn about all indirect dependencies.
</walloftext><other people raging>
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Am 07.07.2013 20:06, schrieb Jeroen De Dauw:
Any objections against starting work on this next week (ie the one that starts tomorrow)? (Do read my original mail (below) first in case you have not yet already done so.)
I'd like to first discuss the implications for deployment for the foundation. Basically, one repo per component is better for testing and "managed" installation, but problematic for deployment and manual installation, as well as for development/code review.
A separate extension for each component means maintaining a lot of compatibility info somehow, somewhere. This is an issue for people installing by hand (yes, composer should help with that) and is a pain for development (there's a major refactoring of the formatter/parser stuff imminent).
I'd like to think about cost v.s benefit again. Why exactly should we do this? And what can we do to make it less of a pain?
Maybe having the components as submodules, instead of separate extensions, would help... Something to ask the Foundation.
-- daniel
On Mon, Jul 8, 2013 at 9:20 AM, Daniel Kinzler daniel.kinzler@wikimedia.dewrote:
Am 07.07.2013 20:06, schrieb Jeroen De Dauw:
Any objections against starting work on this next week (ie the one that
starts
tomorrow)? (Do read my original mail (below) first in case you have not
yet
already done so.)
I'd like to first discuss the implications for deployment for the foundation. Basically, one repo per component is better for testing and "managed" installation, but problematic for deployment and manual installation, as well as for development/code review.
+1
For code reviewing and development, it's more of a hassle to mark patches from other git repo as a dependency, etc. and keep track of things.
A separate extension for each component means maintaining a lot of compatibility info somehow, somewhere. This is an issue for people installing by hand (yes, composer should help with that) and is a pain for development (there's a major refactoring of the formatter/parser stuff imminent).
I'd like to think about cost v.s benefit again. Why exactly should we do this? And what can we do to make it less of a pain?
I am *not* opposed to splitting things up yet not sufficiently convinced the benefits outweigh the hassles at this point.
Maybe having the components as submodules, instead of separate extensions, would help... Something to ask the Foundation.
For code review and development, that might help. For deployments, it could help though not a magic solution.
And if we do agree to split the stuff up, please nobody self merge it! There are things that need to be done first to ensure the tests systems do *not* break. Best to do those first.
Also these changes need to be sufficiently documented and announced widely.
Cheers, Katie
-- daniel
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hey,
And if we do agree to split the stuff up, please nobody self merge it!
There are things that need to be done first to ensure the tests systems do *not* break. Best to do those first.
Also these changes need to be sufficiently documented and announced widely.
Indeed. Let's avoid the confusion we had last time :)
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Hey,
I'd like to think about cost v.s benefit again. Why exactly should we do
this?
You are asking me to repeat the wall of text from the first mail? Please read it again and open discussion on specific points you disagree with.
A separate extension for each component means maintaining a lot of
compatibility info somehow, somewhere.
In case of DataValues we already have one extension per component. The compatbility info is also quite manageable. In fact, it becomes a lot more clear what works together if things are properly kept separate and versioned. Right now I am getting questions from confused users using something based on DataValues and have to tell them to "get latest master of everything" or even things such as "any revision before somehash".
Maybe having the components as submodules, instead of separate extensions,
would help... Something to ask the Foundation.
That is a question on how we make those components available to Wikibase. Exactly how we do this has not all that much effect on the sensibility of the split into multiple git repos. On this particular topic I have no strong opinions, though I am concerned with submodules, as this does not seem to work well when the repos pointed to by those submodules are needed by multiple components. You'll end up having them multiple times no?
is a pain for development
I disagree this is a pain. Or perhaps it is, if you define a "pain" as the effects you have of using type hinting, of not just using globals, and properly injecting dependencies. All these things force explicitness of some sort, which you have to deal with. This explicitness is there to help you and prevent errors. If you try to ignore it, of course you will end up being frustrated. Or if you do not keep it in mind at all, you'll also end up frustrated. Since managing dependencies is one of the most important tasks in software development, you really ought to keep it in mind though.
(there's a major refactoring of the formatter/parser stuff imminent).
Those two components are separate. They do not even know about each other. And that is a very important property. So how does work on them affect a split in any way? I can see several advantages to having a split, such as it being more clear when changes are being made in one component, or being able to release one without being blocked by the other since it is in the middle of a refactor. What are the disadvantages in this case?
Now you can again bring up "oh no, we'll have to constantly make changes in multiple repos, and keeping track of this all will be hell". My answer to this also has not changed: if you split up distinct components and keep in mind all the relevant principles and trade-offs, then having to make changes across multiple repos should be very rare indeed. Almost all of the components that are not lib, repo and client have been created by me. I also did most of the work in these. And yet I did not run into significant hassle. If I had, I'd certainly not be advocating going further down this road.
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
On Mon, Jul 8, 2013 at 5:37 AM, Jeroen De Dauw jeroendedauw@gmail.com wrote:
In case of DataValues we already have one extension per component. The compatbility info is also quite manageable. In fact, it becomes a lot more clear what works together if things are properly kept separate and versioned. Right now I am getting questions from confused users using something based on DataValues and have to tell them to "get latest master of everything" or even things such as "any revision before somehash".
Hi Jeroen,
Can you go into more detail on this point? I'd like to understand a specific situation where this is causing problems for your users (actually, it'd be helpful for me to understand who the non-WMF users of DataValues are more generally).
Thanks Rob
wikidata-tech@lists.wikimedia.org