Neither; the tools we have for running experiments are largely hand-build on an ad hoc basis. For data collection we have tools like eventlogging, although they require developer energy to integrate with [potential area of experimentation]. But for actually analysing the results it looks very different.
Let's use a couple of concrete examples: suppose we wanted to look at whether there was a statistically significant variation in whether or not people edited if we included a contributor tagline, versus didn't. We'd need to take the same set of pages, ideally, and run a controlled study around an A/B test.
So first we'd display one version of the site for 50% of the population and another for the other 50% (realistically we'd probably use smaller sets and give the vast majority of editors the default experience, but it's a hypothetical, so let's run with it). That would require developer energy. Then we'd set up some kind of logging to pipe back edit attempts and view attempts by [control sample/not control sample]. Also developer energy, although much less.
Then, crucially, we'd have to actually do the analysis, which is not something that can be robustly generalised.
In this example we'd be looking for significance, so we'd be looking at using some kind of statistical hypothesis test. Those vary depending on what probability distributions the underlying population follows. So we need to work out what probability distribution is most appropriate, and then apply the test most appropriate to that distribution. And that's not something that can be automated through software. As a result, we get the data and then work out how to test for significance.
The alternate hypothesis would be something observational; you make the change and then compare the behaviour of people while the change is live to their behaviour before and after. This cuts out most of the developer cost but doesn't do anything for the research support or the ad-hoc code and tools that need to come with it.