Am 10.10.2013 18:40, schrieb Rob Lanphier:
Hi folks,
I think Daniel buried the lede here (see his mail below), so I'm mailing this out with a subject line that will hopefully provoke more discussion. :-)
Thanks for bumping this, Rob. And thanks to Tim for moderating this discussion so far, and to everyone for the criticism.
And sorry for my late reply, I'm just now catching up on email after a brief vacation and a few days of conferencing.
I'll reply to the various comments in this thread in this mail, to keep the discussion focused. I hope I managed to reply to all the relevant points.
First off, TLDR:
* I maintain that dependency injection is useful, and less painful than one might think. * I'm open to discussion about whether to use a namespace ID or a canonical namespace name in the TitleValue object * I'm interested in discussing how to best slice and dice the parser and formatter services.
So, here are my replies to some comments:
I agree with this as well. The idea behind this RFC is the "hair can't cut itself" pattern. However, a value object needs to be easily serializable. So what representation is used for serializing a TitleValue? It can't be the display title or DB key since that's part of the TitleFormatter class.
TitleValue as proposed can be serialized without any problems using standard PHP serialization, or as a JSON structure containing the namespace ID and name string. Or we can come up with something nicer, like $nsid:$title or some such. The current Title object cannot be serialized at all directly.
Maybe, but the RFC says, "As a value object, this has no knowledge about namespaces, permissions, etc.". I think there comes a point when you have to acknowledge that some properties of Title objects are indeed part of the value object
The point is avoiding access to config information (which is global state). Namespace names are configuration and require global state for normalization/lookup. This should not be done in a value object.
I read it as TitleValue doesn't know about texual namespaces like Category:, Project:, or User:. But just contains an integer namespace such as `4`.
I understand that using the numeric namespace ID is controversial. TitleValue could also require the canonical (!) namespace name to be provided to the constructor, instead of the numeric id. This would make it harder though to use it for DB queries.
$title = Title::newFromText( $text ); if ( $title ) { return $title->getLocalUrl( 'foo=bar' ); }
newFromText() uses global state to look up namespaces, interwiki prefixes, title normalization options, etc. If we want to get away from relying on global state, we have to avoid this.
getLocalUrl() uses global state to construct the URL using the wiki's base URL. Again, if we want to have reusable, testable code, this must be avoided.
Sure, global state is convenient. You don't have to think about what information is needed where, you don't have to be explicit about your dependencies. This makes it easier to write code. It makes it very hard to understand, test, reuse or refactor the code, though.
Yes, it's more terse and "easier to read". But it hides information flow and implicit dependencies, making the "easy read" quite misleading. It makes it harder to truly understand what is going on.
$tp = new TextTitleParser(); try { $title = $tp->parse( $text ); $tf = new UrlTitleFormatter( $title, 'foo=bar ); return $tf->format(); } catch( MWException ) { return null; }
As Jeroen already pointed out in his reply, you should rarely have the need to turn a page title string into a URL. When generating the URL, you would typically already have a TitleValue object (if not, ask yourself why not).
Catching exceptions should be done as late as possible - ideally, in the presentation layer. Generally, throwing an exception is preferable to returning a "special" value, so the try/catch would not be here in the code.
However, you are right that being explicit about which information and services are needed where means writing more code. That's what "explicit" means. If we agree that it's a good thing to explicitly expose information flow and dependencies, then this implies that we need to actually write the additional (declarative) code.
Maybe my hair can't cut itself, but I can cut my own hair without having to find a Barber instance. ;)
But you will need to find a Scissors instance.
Why do we need a separate TitleParser and TitleFormatter? They use the same configuration data to do complementary operations. And then there's talk about making "TitleFormatter" a one-method class that has subclasses for every different way to format a title, and about having a subclass for wikilinks (huh?) which has subclasses for internal and external and interwiki links (wtf?), and so on.
I'm very much open to discussion regarding how the parser/renderer services are designed. For example:
* It would probably be fine to have a single class for parsing titles in wiki-link syntax, and generating wiki-link style titles. * I'm reluctant to include functions for generating URLs / HTML links in the same class (this needs additional information / services ). * Beware that TitleValue objects represent *titles*, not links - they do not handle interwiki prefixes and such. An interwiki link would need to be represented by a different object (because of the prefix, but also because of namespace IDs vs. namespace names - there is room for discussion here).
High-level methods to actually do anything useful like "move a page" or "parse some wikitext" seem like they are going to need to take dozens of these objects to be able to actually do what they need to do.
I read this as: high level functions rely on multiple services (which is what makes them high level), which we then can control, configure and replace individually. This is a good thing.
(And yes, I know the argument is that then the useful function should be split into multiple smaller functions with fewer arguments, to which I reply "you just pushed the problem up the call stack, you didn't solve it").
No - I *distributed* it along the call stack in a sensible way.
So then the RFC proposes a "ServiceRegistry" to try to get around this problem of everything requiring dozens of arguments. But after observing that RequestContext already *is* this, it then argues against using RequestContext on the basis of additional dependencies.
The difference is that RequestContext is passed around the code a lot, while a registry should be passed around as little as possible. Access to such a registry should be limited to code that, for some reason, cannot have the services themselves injected.
A lot of code uses (read: depends on) RequestContext. Thus, all that code depends on anything RequestContext depends on. Adding more things to RequestContext means making a lot of code depend on more things, making it harder to test, understand and re-use.
"Kitchen-Sink" objects are bad, I'm trying to avoid making the kitchen sink bigger.
Why not have one ServiceRegistry, maybe even part of RequestContext, and use it preferentially for passing services instead of just as a fallback?
Because then everything depends on all the services.
That is, to use or test something that needs e.g. a UserPermissionService, you'd also have to provide all the other services, our you'd at least somehow have to know which services it needs, without this being declared explicitly anywhere.
Do we really have the need for multiple differently-configured services all over the place in "real" code that we need to be passing them around individually, or is it just for testing where we could as easily solve the problem by creating a registry instance that throws errors for all the services not needed for each test?
The point is modularity. Modularity makes it easier to re-use code, easier to understand it, to refactor it. Unit tests just expose dependency issues nicely, because they require modular isolation per definition.
But there are numerous examples in the MediaWiki code base where it is obvious that splitting things into services would have been beneficial, and would have avoided duplicate code and annoying workarounds.
The fact that Title needs the database and thus makes all tests need the database is just one glaring example. Another example is that the underlying logic of special pages is often the same as the underlying logic of API modules. But currently, the logic on special pages is mixed up with HTML generation, and the code in API modules is mixed with parameter processing and result generation, making re-use hard and leading to redundant code.
At the top level something *does* have to do all three things, even if it does it by calling into various "trees" and each of those trees has its own branches and each branch has its own twigs and so on.
Yes, but it then doesn't have to know *how* these things are done, and the logic for doing these things can easily be replaced.
The claimed problem behind a lot of this is "too many dependencies" making things hard to test and the idea that you can somehow make this go away by dividing everything into tinier and tinier pieces. To some extent this works, but at the cost of making the system as a whole harder to understand because you have to track all the little pieces.
You *don't* have to track the pieces, because they are all mentioned explicitly, and your editor (hopefully) allows you to navigate between them easily.
In my experience, what makes a system hard to understand are hidden dependencies and side effects (Title is full of those). Splitting things into mode classes and smaller functions leads to code that is more easily readable.
It increases the number of files, yes. But files are really not the entity we should use to think about software architecture. If the number of files is an issue, something is wrong with the tools we use.
I doubt MediaWiki has reached the point of diminishing returns on that, but I'm not really sure that the end-goal envisioned here is the *right* division.
Well, how would *you* re-factor the Title class, then?
So in the end you either have to also test the thing as a whole anyways. Or you end up with unit tests that are even less useful than you started because they won't catch regressions they would've originally.
Integration tests are useful, but they are no replacement for unit tests (nor vice versa). One big is the number of code pathes to cover (another is the effort of locating errors).
A heavily simplified example: We have three methods to test, where a() uses b() and b() uses c(). Let's say each of these has 3 code paths to test.
If we can test a(), b(), and c() separately (making a() use a mock version of b(), etc), we need 5+5+5 = 15 test cases for full coverage.
If we use an integration test, we'll need one test case for every possible combination of code pathes - that's 5*5*5 = 225 test cases.
In practice, we'll not do this. We'll test, say, 20 typical cases. Which means a lot of code paths are not covered, and may well lead to problems.
Would it make sense to cut a TitleValue branch as a playground? If we can get some code before the January summit, that would give us all a bit more experience.
My experience from introducing the ContentHandler makes me very wary of using a development branch. Rebasing that over and over is going to be very painful.
-- daniel