Hi all!
I invite you to try out "Project Ruprecht"[1][2], a tool that measures the "tangledness" of PHP code, and provides you with a "naughty list" of things to fix.
For now, you will have to install this locally. I hope however to soon have this run automatically against core on a regular basis, perhaps by integrating it with SonarQube. Maybe some day we can also integrate it with CI, to generate a warning when a new cyclic dependency is about to be introduced.
So that was the tl;dr. Now for some context, history, and shout-outs. And some actual real world science, too!
For a while now, I have been talking about the how much of a problem cyclic dependencies are in MediaWiki core: When two components (classes, namespaces, libraries, whatever) depend on each other, directly or indirectly, this means that one cannot be used without the other, nor can it be tested, understood, or modified without also considering the other. So, in effect, they behave as *one* component, not two. Applied to MW core, this means that roughly half of our 1600 classes effectively behave like a single giant class. This makes the code rather hard to deal with.
To fix this, I have been looking for tools that let me identify "tangles" of classes that depend on each other, and metrics' to measure the progress of "untangling" the code. However, the classic code quality metrics focus on "local" properties of the code, so they can't tell us much about the progress of untangling. And the tools I found that would detect cyclic dependencies in PHP code would all choke on MediaWiki core: they would try to list all detected cycles - which, by the super-exponential nature of possible paths through a graph, would be millions and millions. So, the tools would choke and die. That approach isn't practical for us.
Two discoveries allowed me to come up with a working solution: First, I decided to leave the PHP world and turned towards graph analysis tools built for large data sets. Python's graph-tool did the trick. It's build on top of boost and numpy, and it's *fast*. It crunched through the 7500 or so class dependencies in MW core in a split second, and told me that we have 14 "tangles" (non-trivial strongly connected components), and that 43% of our classes are in these tangles, with 40% being part of one big tangle that is essentially our monolith manifest. So now I had a metric to work with: the number of classes in tangles.
That was great, but still didn't tell me where to start. Graph-tool was still not fast enough to deal with millions of cycles, and even if it had been, that data wouldn't be very useful. I needed some smart heuristics. Luckily, I (totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one evening at the WMDE office by telling him about this problem. The next day, he told me that he had been digging into the problem all night, and he had found a paper that sounded relevant, and it also came with working code: "Breaking Cycles in Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S. Parthasarathy. I played with the code a bit, and yes! It spat out a list of 290 or so dependencies[4] that it thought were bad - and I agree for a good number of them. It's not a clean working list, but it gives a very good idea of where to start looking.
I find it quite fascinating that this works so well for cleaning up a codebase. After all, the heuristic wasn't design for this - it was designed for fixing messy ontologies. Indeed, one of their test data sets was (English language) Wikipedia's category system! I'd love to see what it does with Wikidata's subclass hierarchy :)
But I suppose it makes sense - dependencies in software are conceptually a lot like an ontology, and the same strategies of stratification and abstraction apply. And the same difficulties, too - it's easy enough to spot a problematic cycle, but often hard to say where it should be cut. And how to cut it - often, the solution is not to just remove the dependency, but to introduce a new abstraction that allows the relationship to exist without a cycle. I'd love to see the research continue in that direction!
So, a big shout out to the researchers, and to Amir who found the paper!
I hope my ramblings have made you curious to play with Ruprecht, and see what it has to say about other code bases. There's also another feature to play with which I haven't discussed here: detection of risky classes using the Page Rank algorithm. Fun!
Cheers, Daniel
[1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/ [2] https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-ana... [3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies [4] https://phabricator.wikimedia.org/P8513 [5] https://xkcd.com/356/
I wonder how many points you get for nerd snipping a software engineer :)
Wonderful work Daniel, kudos! In the given list of bad dependencies, I find User class depending on Skin class one of the most problematic ones. Also, ApiQuery class depending on all of its subclasses is one of the biggest issue making all of API query modules act as giant big monolith. That can (hopefully) easily addressed and untangle significant portion of the 40%, a rather low hanging fruit.
Best
On Fri, May 10, 2019 at 6:39 PM Daniel Kinzler dkinzler@wikimedia.org wrote:
Hi all!
I invite you to try out "Project Ruprecht"[1][2], a tool that measures the "tangledness" of PHP code, and provides you with a "naughty list" of things to fix.
For now, you will have to install this locally. I hope however to soon have this run automatically against core on a regular basis, perhaps by integrating it with SonarQube. Maybe some day we can also integrate it with CI, to generate a warning when a new cyclic dependency is about to be introduced.
So that was the tl;dr. Now for some context, history, and shout-outs. And some actual real world science, too!
For a while now, I have been talking about the how much of a problem cyclic dependencies are in MediaWiki core: When two components (classes, namespaces, libraries, whatever) depend on each other, directly or indirectly, this means that one cannot be used without the other, nor can it be tested, understood, or modified without also considering the other. So, in effect, they behave as *one* component, not two. Applied to MW core, this means that roughly half of our 1600 classes effectively behave like a single giant class. This makes the code rather hard to deal with.
To fix this, I have been looking for tools that let me identify "tangles" of classes that depend on each other, and metrics' to measure the progress of "untangling" the code. However, the classic code quality metrics focus on "local" properties of the code, so they can't tell us much about the progress of untangling. And the tools I found that would detect cyclic dependencies in PHP code would all choke on MediaWiki core: they would try to list all detected cycles - which, by the super-exponential nature of possible paths through a graph, would be millions and millions. So, the tools would choke and die. That approach isn't practical for us.
Two discoveries allowed me to come up with a working solution: First, I decided to leave the PHP world and turned towards graph analysis tools built for large data sets. Python's graph-tool did the trick. It's build on top of boost and numpy, and it's *fast*. It crunched through the 7500 or so class dependencies in MW core in a split second, and told me that we have 14 "tangles" (non-trivial strongly connected components), and that 43% of our classes are in these tangles, with 40% being part of one big tangle that is essentially our monolith manifest. So now I had a metric to work with: the number of classes in tangles.
That was great, but still didn't tell me where to start. Graph-tool was still not fast enough to deal with millions of cycles, and even if it had been, that data wouldn't be very useful. I needed some smart heuristics. Luckily, I (totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one evening at the WMDE office by telling him about this problem. The next day, he told me that he had been digging into the problem all night, and he had found a paper that sounded relevant, and it also came with working code: "Breaking Cycles in Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S. Parthasarathy. I played with the code a bit, and yes! It spat out a list of 290 or so dependencies[4] that it thought were bad - and I agree for a good number of them. It's not a clean working list, but it gives a very good idea of where to start looking.
I find it quite fascinating that this works so well for cleaning up a codebase. After all, the heuristic wasn't design for this - it was designed for fixing messy ontologies. Indeed, one of their test data sets was (English language) Wikipedia's category system! I'd love to see what it does with Wikidata's subclass hierarchy :)
But I suppose it makes sense - dependencies in software are conceptually a lot like an ontology, and the same strategies of stratification and abstraction apply. And the same difficulties, too - it's easy enough to spot a problematic cycle, but often hard to say where it should be cut. And how to cut it - often, the solution is not to just remove the dependency, but to introduce a new abstraction that allows the relationship to exist without a cycle. I'd love to see the research continue in that direction!
So, a big shout out to the researchers, and to Amir who found the paper!
I hope my ramblings have made you curious to play with Ruprecht, and see what it has to say about other code bases. There's also another feature to play with which I haven't discussed here: detection of risky classes using the Page Rank algorithm. Fun!
Cheers, Daniel
[1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/ [2]
https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-ana... [3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies [4] https://phabricator.wikimedia.org/P8513 [5] https://xkcd.com/356/
-- Daniel Kinzler Principal Software Engineer, Core Platform Wikimedia Foundation
Am 13.05.19 um 12:48 schrieb Amir Sarabadani:
I wonder how many points you get for nerd snipping a software engineer :)
One, as it's rather easy ;)
Wonderful work Daniel, kudos!
Thank you for your help!
In the given list of bad dependencies, I find User class depending on Skin class one of the most problematic ones.
It's nasty on the level of the dependency graph, but it's a trivial dependency, and easily resolved using a trait or utility method: Skin::normalizeKey( $wgDefaultSkin );
Want to give it a go?
Thinking this one step further, all logic related to user options should be factored out of the User class anyway. Filed as https://phabricator.wikimedia.org/T223099Y
Also, ApiQuery class depending on all of its subclasses is one of the biggest issue making all of API query modules act as giant big monolith. That can (hopefully) easily addressed and untangle significant portion of the 40%, a rather low hanging fruit.
Yea... these dependencies a<re however not "real" - they are all dependencies on a class name literal (ApiQueryFooBar::class), not the class itself. I have filed this as a feature request with upstream https://github.com/mihaeu/dephpend/issues/47.
Cheers, Daniel
On 10/05/2019 18:39, Daniel Kinzler wrote:
Hi all!
I invite you to try out "Project Ruprecht"[1][2], a tool that measures the "tangledness" of PHP code, and provides you with a "naughty list" of things to fix.
<snip>
[1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/ [2] https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-ana...
Hello,
A few months ago, the code health group enquired about the Php metrics static analysis tool [1] and it has been rather straightforward to run it automatically and publish its report:
https://doc.wikimedia.org/mediawiki-core/master/phpmetrics/
The report has a list of potential coupling and also show relation between objects. That might be a complement.
As for your tool, I am pretty sure we can easily run it automatically and publish it next to the PHP Metrics report? The addition to CI has been rather straightforward:
A container: https://gerrit.wikimedia.org/r/#/c/integration/config/+/469689/
The Jenkins job: https://gerrit.wikimedia.org/r/#/c/integration/config/+/469690/
[1] https://www.phpmetrics.org/
cheers,
Hey Antoine!
Am 13.05.19 um 20:59 schrieb Antoine Musso:
Hello,
A few months ago, the code health group enquired about the Php metrics static analysis tool [1] and it has been rather straightforward to run it automatically and publish its report:
https://doc.wikimedia.org/mediawiki-core/master/phpmetrics/
The report has a list of potential coupling and also show relation between objects. That might be a complement.
Yea, that's what I used for my initial analysis for the session on decoupling at TechConf in Portland :)
phpmetrics and similar tools are useful, but they do not analyze the transitive dependencies (at least not sufficiently, for my use case). That is, they can't tell me whether two classes are coupled, but they can't tell me which classes to decouple to resolve clusters of tightly coupled code. And they can't measure how "tangled" the codebase is overall, just how "good" or "bad" a given class is (or all classes are, on average).
As for your tool, I am pretty sure we can easily run it automatically and publish it next to the PHP Metrics report? The addition to CI has been rather straightforward:
A container: https://gerrit.wikimedia.org/r/#/c/integration/config/+/469689/
The Jenkins job: https://gerrit.wikimedia.org/r/#/c/integration/config/+/469690/
Ruprecht comes with a few dependencies that may make containerization less straight forward. Not terribly hard, but somewhat annoying. E.g. it needs python 1 *and* 3, it needs graph-tool which has to be installed as a debian package from a non-standard repo, etc.
If you feel like looking into that, I'd of course be happy, of course ;)
Guten Tag,
Time flow, and I forgot to reply to that email :-\
On 13/05/2019 23:11, Daniel Kinzler wrote:
Hey Antoine!
Am 13.05.19 um 20:59 schrieb Antoine Musso:
Hello,
A few months ago, the code health group enquired about the Php metrics static analysis tool [1] and it has been rather straightforward to run it automatically and publish its report:
https://doc.wikimedia.org/mediawiki-core/master/phpmetrics/
The report has a list of potential coupling and also show relation between objects. That might be a complement.
Yea, that's what I used for my initial analysis for the session on decoupling at TechConf in Portland :)
phpmetrics and similar tools are useful, but they do not analyze the transitive dependencies (at least not sufficiently, for my use case). That is, they can't tell me whether two classes are coupled, but they can't tell me which classes to decouple to resolve clusters of tightly coupled code. And they can't measure how "tangled" the codebase is overall, just how "good" or "bad" a given class is (or all classes are, on average).
Oh so yeah that is going way deeper in the graph analysis. I would guess someone with a background in graph theory would be able to help on that front. I love the topic, but not enough to actually learn it :-\
As for your tool, I am pretty sure we can easily run it automatically and publish it next to the PHP Metrics report? The addition to CI has been rather straightforward:
A container: https://gerrit.wikimedia.org/r/#/c/integration/config/+/469689/
The Jenkins job: https://gerrit.wikimedia.org/r/#/c/integration/config/+/469690/
Ruprecht comes with a few dependencies that may make containerization less straight forward. Not terribly hard, but somewhat annoying. E.g. it needs python 1 *and* 3, it needs graph-tool which has to be installed as a debian package from a non-standard repo, etc.
If you feel like looking into that, I'd of course be happy, of course ;)
There is no python 1? I am assuming you are referring to python2.7. They can be coinstalled in the same Debian based container, they simply are different binaries: /usr/bin/python2 /usr/bin/python3. Should be easy.
Depends on what a non standard repo is? The CI images are build with the non-standard repository apt.wikimedia.org, we have some PHP packages from sury.org (the person who also package PHP for Debian) and others. It is super easy, has long as the packages are signed with a gpg key.
If you could please fill a task against #continuous-integration-config in Phabricator with a rough breakdown of how to setup the stack. I am sure it can be done rather quickly.
wikitech-l@lists.wikimedia.org