Expertise is learning

This recent tweet resonated with people:

An expert is someone who keeps learning even when learning is harder to find.
Moi

I’m hesitant to claim expertise about much of anything — not least because being an expert doesn’t imply continuing to be an expert — but I’ll readily admit to being pretty good, when faced with a problem, at learning whatever I need to learn to solve it. If I’m expert at anything, it’s that.

Debugging is learning

This less recent tweet of similar form resonated with considerably fewer people:

A bug is a mental-model mismatch. Your mental model of the code is a cache. To debug well, clear your cache.
Moi aussi

A bug is when we thought we knew something and it turns out we don’t. So debugging, by definition, is learning: figuring out what we don’t know, then problem-solving our way to knowing it.

I can’t stand debugging

I like things I’m good at. I’m good at learning. I hate debugging. Let’s debug that.

If we need an extended debugging session, we’ve screwed up. We’ve arranged to have no idea how close we are until we’re suddenly done — if we’re ever done. We’ve arranged to spend an indeterminate, unbounded amount of time and patience to learn something very crucial and small, late in the game, when it’s expensive.

Had we arranged to spend a tiny bit more time and patience earlier, we could have been directing our efforts much more productively (and predictably) now. But here we are, debugging. If I can’t get over my revulsion at having put ourselves in this situation, I compound the cost. So I’ve learned to get over it and on with it.

I’m okay at debugging

I have no particular skill with debugging tools. I do have particular skill at developing software in a way that minimizes the need for debugging. Sometimes my ego-wires get crossed and I’ll take pride directly in my lack of skill with debuggers. Whoops!

Even though I can’t stand debugging, it is learning and for that reason I am (if I can get out of my own emotional way) somewhat skilled at it.

Debugging even when debugging is harder to do

When my niece is ideating, she says “How about this?” Sometimes more than once, to buy time, and perhaps also to generously prepare me to board her flight of fancy.

How about this? There’s a bug. If we don’t fix it, a whole bunch of related functionality that is very valuable (and urgently needed) is fairly useless. So we need to fix it.

We didn’t write any of this code. We have only a surface familiarity with the problem domain and the system architecture. We are, nonetheless, going to be the ones working on this bug.

The bug is somewhere in a big pile of long-method, interspersed-concern, tightly-coupled, ServiceLocator-using, completely-untouched-by-test code. (But I repeat myself.) Naturally, we didn’t uncover it until pre-production. We can reproduce it fairly reliably there, given sufficient load and concurrency. We haven’t been able to reproduce it in any other environments. We aren’t allowed to attach a remote debugger in pre-production (not that that’d help me, because I’m awesome at not knowing debugger (whoops again!)).

What can’t we do?

This less unrecent tweet resonated with less fewer folks:

Chasing a bug, how often do you ask “Did that fix it?” Be kind to yourself. Get answers fast. Write a failing test first.
Moi-mΓͺme encore une fois

Do we know exactly where the bug is? No.

Can we reproduce the bug faster? Not that we know of. It would be ideal if we could first reproduce it locally in any way whatsoever, and then automate that with a failing test. We’d give ourselves many more chances per day to take actions that close in on the problem. But let’s say we know enough about the components in interaction to know it’s prohibitively expensive to try.

What can we do?

We can obtain server logs, and we see exceptions thrown from our code.

The feedback loop

Armed with the logs and the following technique, we can narrow down where the bug must be hiding.

  1. In the logs, skip past the app server crud to the stack trace from our code.
  2. In the code, inspect the method that threw the exception.
  3. In tests, identify all the ways that method can possibly throw that exception. Use any means necessary to get under test.
  4. For each place the exception could be thrown, add a guard clause that returns a sensible value right after it writes uniquely and identifiably to the server log.
  5. Deploy; reproduce the bug; collect the logs; verify that the new exception is coming from one level down the call stack; goto 2.

This is legacy code. We’ll increase visibility to make clunky private methods testable. We’ll mock anything and everything that should have been a fakeable dependency. The tests will be inelegant, but we’ll have some. The code is already inelegant, untested, and buggy. We’re making it slightly more inelegant, but so what? We’re going to make it less buggy where it really counts, and we’re going to leave behind tests covering code that was evidently both valuable and hard to get right.

When we detected the system’s fingers tingling, that was just a symptom. The technique here is to squeeze the capillaries and push toward the heart of the bug. N times through the feedback loop, as we’re putting tests around the Nth method down the stack, we’ll realize we’ve arrived. This must be the place. Then we can write a test that’s red because the bug exists, and turns green when we’ve squashed it. When we redeploy and can no longer reproduce the problem, it’s gone.

Retrospective

Despite all the obstacles, we figured out how to learn what we needed to learn to solve the problem.

This method might seem like brute force. In a way, it is. If we’d been test-driving all along, the effort would have been amortized, and it wouldn’t have occurred to anyone that concerted effort was required to avoid this bug. It probably wouldn’t have occurred to anyone that this bug could even exist; after all, it didn’t occur to us this time.

If we needed brute force, it’s because we made ourselves need it. We are not obligated to continue doing this to ourselves.