I wonder how many points you get for nerd snipping a software engineer :)
Wonderful work Daniel, kudos! In the given list of bad dependencies, I find User class depending on Skin class one of the most problematic ones. Also, ApiQuery class depending on all of its subclasses is one of the biggest issue making all of API query modules act as giant big monolith. That can (hopefully) easily addressed and untangle significant portion of the 40%, a rather low hanging fruit.
Best
On Fri, May 10, 2019 at 6:39 PM Daniel Kinzler dkinzler@wikimedia.org wrote:
Hi all!
I invite you to try out "Project Ruprecht"[1][2], a tool that measures the "tangledness" of PHP code, and provides you with a "naughty list" of things to fix.
For now, you will have to install this locally. I hope however to soon have this run automatically against core on a regular basis, perhaps by integrating it with SonarQube. Maybe some day we can also integrate it with CI, to generate a warning when a new cyclic dependency is about to be introduced.
So that was the tl;dr. Now for some context, history, and shout-outs. And some actual real world science, too!
For a while now, I have been talking about the how much of a problem cyclic dependencies are in MediaWiki core: When two components (classes, namespaces, libraries, whatever) depend on each other, directly or indirectly, this means that one cannot be used without the other, nor can it be tested, understood, or modified without also considering the other. So, in effect, they behave as *one* component, not two. Applied to MW core, this means that roughly half of our 1600 classes effectively behave like a single giant class. This makes the code rather hard to deal with.
To fix this, I have been looking for tools that let me identify "tangles" of classes that depend on each other, and metrics' to measure the progress of "untangling" the code. However, the classic code quality metrics focus on "local" properties of the code, so they can't tell us much about the progress of untangling. And the tools I found that would detect cyclic dependencies in PHP code would all choke on MediaWiki core: they would try to list all detected cycles - which, by the super-exponential nature of possible paths through a graph, would be millions and millions. So, the tools would choke and die. That approach isn't practical for us.
Two discoveries allowed me to come up with a working solution: First, I decided to leave the PHP world and turned towards graph analysis tools built for large data sets. Python's graph-tool did the trick. It's build on top of boost and numpy, and it's *fast*. It crunched through the 7500 or so class dependencies in MW core in a split second, and told me that we have 14 "tangles" (non-trivial strongly connected components), and that 43% of our classes are in these tangles, with 40% being part of one big tangle that is essentially our monolith manifest. So now I had a metric to work with: the number of classes in tangles.
That was great, but still didn't tell me where to start. Graph-tool was still not fast enough to deal with millions of cycles, and even if it had been, that data wouldn't be very useful. I needed some smart heuristics. Luckily, I (totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one evening at the WMDE office by telling him about this problem. The next day, he told me that he had been digging into the problem all night, and he had found a paper that sounded relevant, and it also came with working code: "Breaking Cycles in Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S. Parthasarathy. I played with the code a bit, and yes! It spat out a list of 290 or so dependencies[4] that it thought were bad - and I agree for a good number of them. It's not a clean working list, but it gives a very good idea of where to start looking.
I find it quite fascinating that this works so well for cleaning up a codebase. After all, the heuristic wasn't design for this - it was designed for fixing messy ontologies. Indeed, one of their test data sets was (English language) Wikipedia's category system! I'd love to see what it does with Wikidata's subclass hierarchy :)
But I suppose it makes sense - dependencies in software are conceptually a lot like an ontology, and the same strategies of stratification and abstraction apply. And the same difficulties, too - it's easy enough to spot a problematic cycle, but often hard to say where it should be cut. And how to cut it - often, the solution is not to just remove the dependency, but to introduce a new abstraction that allows the relationship to exist without a cycle. I'd love to see the research continue in that direction!
So, a big shout out to the researchers, and to Amir who found the paper!
I hope my ramblings have made you curious to play with Ruprecht, and see what it has to say about other code bases. There's also another feature to play with which I haven't discussed here: detection of risky classes using the Page Rank algorithm. Fun!
Cheers, Daniel
[1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/ [2]
https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-ana... [3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies [4] https://phabricator.wikimedia.org/P8513 [5] https://xkcd.com/356/
-- Daniel Kinzler Principal Software Engineer, Core Platform Wikimedia Foundation