This week the team had a “firebreak”, a week of doing whatever you want so long as it’s useful. I spent the time improving healthchecks for GOV.UK’s apps.

The way our healthchecks used to work had a few problems:

  • Our healthchecks always returned a 200 status code, and communicated the details (which could be something like status: critical) in the response body. So software which expects the status code to be meaningful—like an AWS load balancer—assumes our apps are healthy even when they’re not.

  • Because our monitoring system is powered by ancient versions of Graphite and Icinga (and configured with an ancient version of Puppet), adding new alerts “properly” is difficult, so lots of apps co-opt their healthcheck to check other things as well. Like whether the database has any soon-to-expire API tokens: this is something which needs monitoring, but it’s not a healthcheck.

  • Every app had a single /healthcheck endpoint, rather than follow the current industry good practice of having separate liveness and readiness healthchecks.

So I worked through our ~50 apps, implementing separate liveness and readiness healthchecks, and implementing new alerts for those not-healthchecks which had proliferated. I still need to deploy the monitoring configuration change, and then the AWS load balancer configuration change. I hope to get that done on Monday morning, then I can go through all the apps again and delete the old /healthcheck endpoint.

While it wasn’t a very exciting piece of work (opening 50 almost identical PRs doesn’t stretch the creative muscles much), it was satisfying to work through, and to put GOV.UK in a slightly better place. Now, if only our monitoring system wasn’t so old…


This week I read:

Software Engineering