Stephen Freeman Rotating Header Image

Test-Driven

Some mocks

“I don’t use a knife when I eat, if I need one it means the food hasn’t been cut up enough.”
“Forks are unnecessary, I can do everything I want with a pointed knife.” [1]

One of the things we realised when writing “Growing Object-Oriented Software”, is that the arguments about mocks in principle are often meaningless. Some of us think about objects in terms of Alan Kay’s emphasis on message passing, others don’t. In my world, I’m interested in the protocols of how objects communicate, not what’s inside them, so testing based on interactions is a natural fit. If that’s not the kind of structure you’re working with right now, then testing with mocks is probably not the right technique.

This post was triggered by Arlo Belshee’s on “The No Mocks Book”. I think he has a really good point, buried in some weaker arguments (the developers I work with don’t use mocks just to minimise design churn, they’re as ruthless as anyone when it comes to fixing structure). His valuable point is that it’s a really good idea to try setting yourself style constraints to re-examine old habits, as in this “object-oriented calisthenics exercise”. As I once wrote, Test-Driven Development itself can have this property of forcing a developer to stop and think about what’s really needed rather than pushing ahead with an implementation.

As for Arlo’s example, I don’t think I can provide a “better” solution without knowing more detail. As he points out in the comments, this is legacy code, so it’s harder to use mocks for interface discovery. I think Arlo’s partner is right that the ProjectFile.LoadFrom is problematic. For me the confusion is likely to be the combination of reading bytes from a disk and the construction of a domain structure, I’d expect better structure if I separated them. In practice, what I’d probably do is some “annealing” by inlining code and looking for a better partition. Finally, it would be great if Arlo could finish up the reworking, I can believe that he has a much better solution in mind but I’m struggling with the value of this step.

There is one more thing we agree on, the idea of building a top-level language that makes sense in the domain. Calling methods on the same object, like a Smalltalk cascade, is one way, although there’s nothing in the type to reveal the protocol—how the calls relate to each other. We did this in jMock1, where we used interface chaining to guide the programmer and the IDE as to what to do next. Arlo’s example is simple enough that the top-level can be inspected to see if it makes sense. I’ve worked in other domains where things are complicated enough that we really did need some checked examples to make us feel confident that we’d got it right.

1) of course, most of the world eats with their hands or chopsticks, so this is a culturally-opressive metaphor

On the composeability of Hamcrest matchers

A recent discussion on the hamcrest java users list got me thinking that I should write up a little style guide, in particular about how to create custom Hamcrest matchers.

Reporting negative scenarios

The issue, as raised by “rdekleijn”, was that he wasn’t getting useful error messages when testing a negative scenario. The original version looked something like this, including a custom matcher:

public class OnComposabilityExampleTest {
  @Test public void
  wasNotAcceptedByThisCall() {
    assertThat(theObjectReturnedByTheCall(),
               not(hasReturnCode(HTTP_ACCEPTED)));
  }

  private Matcher
  hasReturnCode(final int returnCode) {
    return new TypeSafeDiagnosingMatcher() {
      @Override protected boolean
      matchesSafely(ThingWithReturnCode actual, Description mismatch) {
        final int returnCode = actual.getReturnCode();
        if (expectedReturnCode != returnCode) {
          mismatch.appendText("return code was ")
                  .appendValue(returnCode);
          return false;
        }
        return true;
      }

      @Override
      public void describeTo(Description description) {
        description.appendText("a ThingWithReturnCode equal to ")
                   .appendValue(returnCode);
      }
    };
  }
}

which produces an unhelpful error because the received object doesn’t have a readable toString() method.

java.lang.AssertionError:
Expected: not a ThingWithReturnCode equal to <202>
     but: was 

The problem is that the not() matcher only knows that the matcher it wraps has accepted the value. It can’t ask for a mismatch description from the internal matcher because at that level the value has actually matched. This is probably a design flaw in Hamcrest (an early version had a way to extract a printable representation of the thing being checked), but we can use this moment to think about improving the design of the test. We can work with Hamcrest which is designed to be very composeable.

Separating concerns

The first thing to notice is that the custom matcher is doing too much, it’s extracting the value and checking that it matches. A better design would be to split the two activities and delegate the decision about the validity of the return code to an inner matcher.

public class OnComposabilityExampleTest {
  @Test public void
  wasNotAcceptedByThisCall() {
    assertThat(theObjectReturnedByTheCall(),
               hasReturnCode(not(equalTo(HTTP_ACCEPTED))));
  }

  private Matcher
  hasReturnCode(final Matcher codeMatcher) {
    return new TypeSafeDiagnosingMatcher() {
      @Override protected boolean
      matchesSafely(ThingWithReturnCode actual, Description mismatch) {
        final int returnCode = actual.getReturnCode();
        if (!codeMatcher.matches(returnCode)) {
          mismatch.appendText(" return code ");
          codeMatcher.describeMismatch(returnCode, mismatch);
          return false;
        }
        return true;
      }

      @Override
      public void describeTo(Description description) {
        description.appendText("a ThingWithReturnCode with code ")
                   .appendDescriptionOf(codeMatcher);
      }
    };
  }
}

which gives the much clearer error:

java.lang.AssertionError:
Expected: a ThingWithReturnCode with code not <202>
     but: return code was <202>

Now the assertion line in the test reads better, and we have the flexibility to make assertions such as hasReturnCode(greaterThan(25)) without changing our custom matcher.

Built-in support

This is such a common situation that we’ve included some infrastructure in the Hamcrest libraries to make it easier. There’s a template FeatureMatcher, which extracts a “feature” from an object and passes it to a matcher. In this case, it would look like:

private Matcher
hasReturnCode(final Matcher codeMatcher) {
  return new FeatureMatcher(
          codeMatcher, "ThingWithReturnCode with code", "code") {
    @Override
    protected Integer featureValueOf(ThingWithReturnCode actual) {
      return actual.getReturnCode();
    }
  };
}

and produces an error:

java.lang.AssertionError:
Expected: ThingWithReturnCode with code not <202>
     but: code was <202>

The FeatureMatcher handles the checking of the extracted value and the reporting.

Finally, in this case, getReturnCode() conforms to Java’s bean format so, if you don’t mind that the method reference is not statically checked, the simplest thing would be to avoid writing a custom matcher and use a PropertyMatcher instead.

public class OnComposabilityExampleTest {
  @Test public void
  wasNotAcceptedByThisCall() {
    assertThat(theObjectReturnedByTheCall(),
               hasProperty("returnCode", not(equalTo(HTTP_ACCEPTED))));
  }
}

which gives the error:

java.lang.AssertionError:
Expected: hasProperty("returnCode", not <202>)
     but: property 'returnCode' was <202>

Another reason not to log directly in your code

I’ve been ranting for some time that it’s a bad idea directly to mix logging with production code. The right thing to do is to introduce a collaborator that has a responsibility to provide structured notifications to the outside world about what’s happening inside an object. I won’t go through the whole discussion here but, somehow, I don’t think I’m winning this one.

Recently, a team I know provided another reason to avoid mixing production logging with code. They have a system that processes messages and have been asked to record all the accepted messages for later reconciliation with an upstream system. They did what most Java teams would do and logged incoming messages in the class that processes them. Then they associated a special appender with that class’s logger that writes its entries to a file somewhere for later checking. The appenders are configured in a separate XML file.

One day the inevitable happened and they renamed the message processing class during a refactoring. This broke the reference in the XML configuration and the logging stopped. It wasn’t caught for a little while because there wasn’t a test. So, lesson one is that, if it matters, there should have been a test for it. But this is a pretty rigorous team that puts a lot of effort into doing things right (I’ve seen much worse), so how did they miss it?

I think part of it is the effort required to test logging. A unit test won’t do because the structure includes configuration, and acceptance tests run slowly because loggers buffer to improve performance. And part of it is to do with using a side effect of system infrastructure to implement a service. There’s nothing in the language of the implementation code that describes the meaning of reporting received messages: “it’s just logging”.

Once again, if I want my code to do something, I should just say so…

Update: I’ve had several responses here and on other media about how teams might avoid this particular failure. All of them are valid, and I know there are techniques for doing what I’m supposed to while using a logging framework.

I was trying to make a different point—that some code techniques seem to lead me in better directions than others, and that a logging framework isn’t one of them. Once again I find that the trickiness in testing an example like this is a clue that I should be looking at my design again. If I introduce a collaboration to receive structured notifications, I can separate the concepts of handling messages and reporting progress. Once I’ve split out the code to support the reconciliation messages, I can test and administer it separately—with a clear relationship between the two functions.

None of this guarantees a perfect design, but I find I do better if I let the code do the work.

Test-First Development 1968

Seeing Kevlin Henney again at the Goto conference reminded me of a quotation he cited at Agile on the Beach last month.

In 1968, NATO funded a conference with the then provocative title of Software Engineering. Many people feel that this is the moment when software development lost its way, but the report itself is more lively that its title suggests.

It turns out that “outside in” development, with early testing is older than we thought. Here’s a quote from the report by Alan Perlis:

I’d like to read three sentences to close this issue.

  1. A software system can best be designed if the testing is interlaced with the designing instead of being used after the design.
  2. A simulation which matches the requirements contains the control which organizes the design of the system.
  3. Through successive repetitions of this process of interlaced testing and design the model ultimately becomes the software system itself. I think that it is the key of the approach that has been suggested, that there is no such question as testing things after the fact with simulation models, but that in effect the testing and the replacement of simulations with modules that are deeper and more detailed goes on with the simulation model controlling, as it were, the place and order in which these things are done.

It’s all out there in our history, we just have to be able to find it.

Going to Goto (twice)

GOTO ConferencesI’ll be at Goto Aarhus October 9-14 this year, giving a presentation and workshop on Nat Pryce and my material on using Test-Driven Development at multiple levels, guiding the design of system components as well as the objects within them.

If you register with the code free1250, you’ll get a discount of 1250 DKK and Goto will donate the same amount to Computers for Charities

Some of us are then rushing to Goto Amsterdam, where I’ll be giving the talk again on Friday. Again the code free1250 will do something wonderful, but I’m not quite sure what.

Test-Driven Development and Embracing Failure

At the last London XpDay, some teams talked about their “post-XP” approach. In particular, they don’t do much Test-Driven Development because they find it’s not worth the effort. I visited one of them, Forward, and saw how they’d partitioned their system into composable actors, each of which was small enough to fit into a couple of screens of Ruby. They release new code to a single server in their farm, watching the traffic statistics that result. If it’s successful, they carefully propagate it out to the rest of the farm. If not, they pull it and try something else. In their world, the improvement in traffic statistics, the end benefit of the feature, is what they look for, not the implemented functionality.

I think this fits into Dave Snowden’s Cynefin framework, where he distinguishes between the ordered and unordered domains. In the ordered domain, causes lead to effects. This might be difficult to see and require an expert to interpret, but essentially we expect to see the same results when we repeat an action. In the complex, unordered domain, there is no such promise. For example, we know that flocking birds are driven by three simple rules but we can’t predict exactly where a flock will go next. Groups of people are even more complex, as conscious individuals can change the structure of a system whilst being part of it. We need different techniques for working with ordered and unordered systems, as anyone who’s tried to impose order on a gang of unruly programmers will know.

Loosely, we use rules and expert knowledge for ordered systems, the appropriate actions can be decided from outside the system. Much of the software we’re commissioned to build is about lowering the cost of expertise by encoding human decision-making. This works for, say ticket processing, but is problematic for complex domains where the result of an action is literally unknowable. There, the best we can do to influence a system is to try probing it and be prepared to respond quickly to whatever happens. Joseph Pelrine uses the example of a house party—a good host knows when to introduce people, when to top up the drinks, and when to rescue someone from that awful bore from IT. A party where everyone is instructed to re-enact all the moves from last time is unlikely to be equally successful1. Online start-ups are another example of operating in a complex environment: the Internet. Nobody really knows what all those people will do, so the best option is to act, to ship something, and then respond as the behaviour becomes clearer.

Snowden distinguishes between “fail-safe” and “safe-fail” initiatives. We use use fail-safe techniques for ordered systems because we know what’s supposed to happen and it’s more effective to get things right—we want a build system that just works. We use safe-fail techniques for unordered systems because the best we can do is to try different actions, none of which is large enough to damage the system, until we find something that takes us in the right direction—with a room full of excitable children we might try playing a video to see if it calms them down.

At the technical level, Test-Driven Development is largely fail-safe. It allows us, amongst other benefits, to develop code that just works (for multiple meanings of “work”). We take a little extra time around the writing of the code, which more than pays back within the larger development cycle. At higher levels, TDD can support safe-fail development because it lowers the cost of changing our mind later. This allows us to take an interim decision now about which small feature to implement next or which design to choose. We can afford to revisit it later when we’ve seen the result without crashing the whole project.

Continuous deployment environments such as at Forward2, on the other hand, emphasize “safe-fail”. The system is partitioned up so that no individual change can damage it, and the feedback loop is tight enough that the team can detect and respond to changes very quickly. That said, even the niftiest lean start-up will have fail-safe elements too, a sustained network failure or a data breach could be the end of the company. Start-ups that fail to understand this end up teetering on the edge of disaster.

We’ve learned a lot over the last ten years about how to tune our development practices. Test-Driven Development is no more “over” than Object-Orientation is, it’s just that we understand better how to apply it. I think our early understanding was coloured by the fact that the original eXtreme Programming project, C3, was payroll, an ordered system; I don’t want my pay cheque worked out by trying some numbers and seeing who complains3. We learned to Embrace Change, that it’s a sign of a healthy development environment rather than a problem. As we’ve expanded into less predictable domains, we’re also learning to Embrace Failure.


1) this is a pretty good description of many “Best Practice” initiatives
2) Fred George has been documenting safe-fail in the organisation of his development group too, he calls it “Programmer Anarchy
3) although I’ve seen shops that come close to this

Speaking and tuting at QCon London. 7-11 March

Speaking at QCon London 2011 Nat and I will be running our “TDD at the System Scale” tutorial at QCon London. Sign up soon.

I’ll also be presenting an engaging rant on why we should aspire to living and working in a world where stuff just works.

If you quote the promotion code FREE100 when you sign up, QCon will give you a discount of £100 and the same amount to Crisis Charity.

Responding to Brian Marick

Brian’s been paying us the compliment of taking our book seriously and working through our extended example, translating it to Ruby.

He has a point of contention in that he’s doubtful about the value of our end-to-end tests. To be more precise, he’s doubtful about the value of our automated end-to-end tests, a view shared by J.B.Rainsberger, and Arlo Belshee and Jim Shore. That’s a pretty serious group. I think the answer, as always, is “it depends”.

There are real advantages to writing automated end-to-end tests. As Nat pointed out in an extended message to the mailing list for the book,

Most significantly to me, however, is the difference between “testing” end-to-end or through the GUI and “test-driving”. A lot of people who are evangelical about TDD for coding do not use end-to-end tests for driving design at the system scale. I have found that writing tests gives useful design feedback, no matter what the scale.

For example, during Arlo and Jim’s session, I was struck by how many of the “failure stories” described situations where the acceptance tests were actually doing their job: revealing problems (such as deployment difficulties) that needed to be fixed.

Automating an end-to-end test helps me think more carefully about what exactly I care about in the next feature. Automating tests for many features encourages me to work out a language to describe them, which clarifies how I describe the system and makes new features easier to test.

And then there’s scale. Pretty much anything will work for a small system (although Alan Shalloway has a story about how even a quick demonstrator project can get out of hand). For larger systems, things get complicated, people come and go, and the team isn’t quite as confident as it needs to be about where things are connected. Perhaps these are symptoms of weaknesses in the team culture, but it seems wasteful to me to take the design experience we gained while writing the features not encode it somewhere.

Of course this comes at a price. Effective end-to-end tests take skill, experience, and (most important) commitment. Not every system I’ve seen has been programmed by people who are as rigorous as Nat about making the test code expressive or allowing testing to drive the design. Worse, a large collection of badly written end-to-end tests (a pattern I’ve seen a few times) is a huge drag on development. Is that price worth paying? It (ahem) depends, and part of the skill is in finding the right places to test.

So, let me turn Brian’s final question around. What would it take to make automated end-to-end tests less scary?

Friday 13th, Talking at Skills Matter

Prove you’re not superstitious! I’ll be giving my talk on Sustainable TDD at Skills Matter on Friday, 13th November. Sign up here (if you dare).

This talk is about the qualities we look for in test code that keep the development “habitable.” We want to make sure the tests pull their weight by making them expressive, so that we can tell what’s important when we read them and when they fail, and by making sure they don’t become a maintenance drag themselves. We need to apply as much care and attention to the tests as we do to the production code, although the coding styles may differ. Difficulty in testing might imply that we need to change our test code, but often it’s a hint that our design ideas are wrong and that we ought to change the production code. In practice, these qualities are all related to and support each other. Test-driven development combines testing, specification, and design into one holistic activity.

I just ran it at the BBC and people seemed to like it.

If you miss this opportunity, you can always see it at QCon San Francisco.

QCon San Francisco

QCon

I’m running a track at QCon in San Francisco on Friday 20th November. The topic is Technical Skills for Agile Development, and it’s about some of the technical essentials that Agile teams need to keep moving.

I’ll be presenting a session, based on material from our book, on how to live with your tests over the long term.

See you there?