barrucadu's memos - General

How DNS works

2022-04-03T00:00:00Z

The Domain Name System is a huge distributed eventually-consistent database¹ mapping names, like memo.barrucadu.co.uk, to numbers, like 116.203.34.201. It’s federated, with trusted entities (you may have heard of the “DNS root servers”) delegating control of segments of the DNS namespace to others. It holds hundreds of millions of records, and updates to this database are typically visible in minutes to hours.

And the protocol behind it is not massively different to when it was standardised in the 1980s.

In this memo I’ll cover:

The DNS protocol
How your browser gets from memo.barrucadu.co.uk to an IP address
What a “zone” is
The difference between authoritative, recursive, and forwarding nameservers
What happens when you update a DNS record (there’s no such thing as “propagation”)
Finally, whether these old standards I’m talking about are still enough, today

If you want to get it straight from the horse’s mouth, RFC 1034: Domain Names - Concepts and Facilities and RFC 1035: Domain Names - Implementation and Specification are the standards I’m drawing on. They’re very approachable, and I encourage you to read them.

You can also look at resolved, the DNS server I wrote, which acts as both a recursive (or forwarding) and authoritative nameserver, and is suitable for home networks. Well, my home network. I can’t promise anything about yours.

The DNS protocol

Let’s start with an example:²

$ dig memo.barrucadu.co.uk +noedns
; <<>> DiG 9.16.25 <<>> memo.barrucadu.co.uk +noedns
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37169
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;memo.barrucadu.co.uk.          IN      A

;; ANSWER SECTION:
memo.barrucadu.co.uk.   292     IN      CNAME   barrucadu.co.uk.
barrucadu.co.uk.        292     IN      A       116.203.34.201

;; AUTHORITY SECTION:
barrucadu.co.uk.        2975    IN      NS      ns-98.awsdns-12.com.
barrucadu.co.uk.        2975    IN      NS      ns-1520.awsdns-62.org.
barrucadu.co.uk.        2975    IN      NS      ns-1828.awsdns-36.co.uk.
barrucadu.co.uk.        2975    IN      NS      ns-763.awsdns-31.net.

;; Query time: 0 msec
;; SERVER: 185.12.64.2#53(185.12.64.2)
;; WHEN: Tue Mar 22 16:42:02 GMT 2022
;; MSG SIZE  rcvd: 202

I’ve used dig a lot so I’m fairly used to reading this output, but I’ve since realised I wasn’t really reading it.

What does flags: qr rd ra mean?

The QUESTION SECTION and ANSWER SECTION make sense, but what’s the point of the AUTHORITY SECTION? Do all queries have an AUTHORITY SECTION?

$ dig www.google.com +noedns
; <<>> DiG 9.16.25 <<>> www.google.com +noedns
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46676
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         102     IN      A       142.250.185.100

;; Query time: 0 msec
;; SERVER: 185.12.64.2#53(185.12.64.2)
;; WHEN: Tue Mar 22 16:49:36 GMT 2022
;; MSG SIZE  rcvd: 48

…no AUTHORITY SECTION there. Is it unimportant? Or optional?

Also, all the domain names there have a trailing dot. What’s that about?³

Time to dig into the protocol. RFC 1035 is our guide here.

Format of a DNS Message

DNS has two types of messages, queries and responses, and uses port 53. It prefers UDP but, if a message is too long to send in a single UDP datagram, it falls back to TCP.

A DNS message has five parts. These are:

A header, which specifies what sort of message this is and how many entries are in the other parts. This also has those flags we saw in the dig output.
The “question section”, which specifies what sort of records the client is interested in. Did you know that you can ask multiple questions with a single DNS query? I didn’t.
The “answer section”, a collection of records directly answering the questions.
The “authority section”, a series of NS records pointing to an authoritative source which can answer the questions.
The “additional section”, a series of records which may be useful when using records from the answer and authority sections. For example, the A records for any nameservers given in the authority section.

The answer, authority, and additional sections won’t be present in a query. But the question section will be present in a response: it’s copied over from the query.

The Header

The header is 12 bytes long and has a few different fields packed in there. RFC 1035 has some nice ASCII art illustrations:

                                    1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      ID                       |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |QR|   Opcode  |AA|TC|RD|RA|   Z    |   RCODE   |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    QDCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    ANCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    NSCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    ARCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Where,

ID is a 16-bit random identifier set by the client and copied into the response by the server. Since UDP is connectionless, this is essential for the client to know which response goes with which query.⁴
QR indicates whether this is a query (0) or a response (1).
OPCODE is a four-bit field, set by the client and copied into the response by the server, indicating what type of query this message is. The most common opcode is 0, which is a “standard query”.
AA (“Authoritative Answer”) is set by the server and means that this response is authoritative.

More on authority in zones?
TC (“Truncation”) is set by the server and means that the full response couldn’t fit in a single UDP datagram, and so the client should try again using TCP.⁵
RD (“Recursion Desired”) is set by the client, and copied into the response by the server, and means that they would like the server to answer the question recursively, if they can.

More on recursive and non-recursive resolution in how resolution happens.
RA (“Recursion Available”) is set by the server and means that it can perform recursive resolution, if requested.
Z is reserved for future use, and so should be set to zero if you don’t implement those future standards.
RCODE is a four-bit field, set by the server, indicating what type of response this message is. There are a few common ones:
- 0 means no error
- 1 means the server couldn’t understand the query
- 2 means the server encountered an error processing the query
- 3 means the domain name in the query doesn’t exist
- 4 means the server doesn’t support this sort of query
- 5 means the server refused to answer the query
QDCOUNT, ANCOUNT, NSCOUNT, and ARCOUNT are unsigned 16-bit (big endian) integers specifying the number of entries in the question, answer, authority, and additional sections (respectively) of the message.

Since all the multi-byte fields in a DNS message are unsigned and big endian, I’ll not mention it from now on.

Domain Names

Before diving into the other sections, let’s have a look at how domain names are encoded. They show up a lot, after all.

Let’s take the domain name memo.barrucadu.co.uk., and separate it by dots. This gives us a sequence of labels:

memo
barrucadu
co
uk
(the empty label)

How you actually interpret those labels is a bit confused, unfortunately.

RFC 1035 says that they are sequences of arbitrary octets and that you can’t assume any particular character encoding… but it also says that labels are to be compared case-insensitively.

RFC 4343 clarifies that that means octets in the range 0x41 to 0x5a (the upper case ASCII letters) are considered equal to corresponding octets in the range 0x61 to 0x7a (the lower case ASCII letters), and vice versa, but that that still doesn’t mean that labels are ASCII, as they can also contain arbitrary non-ASCII octets.

But there’s also RFC 3492, which defines the punycode standard for encoding internationalised, i.e. unicode, domain names into ASCII. So maybe domain names are ASCII after all?

There may well be a later RFC which resolves this ambiguity and says that labels are definitely ASCII, but I haven’t seen it yet.

Anyway, back to the topic of encoding.

A label is encoded as a one-octet length field followed by the octets of the label. And an encoded domain name is a sequence of encoded labels. This means that a domain name ends with 0x00, the length of the empty label.⁶

So memo.barrucadu.co.uk is encoded as:

0x04 m e m o 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00

There are two restrictions on the validity of domain names:

A single label may be no more than 63 octets long (not including the length octet)
An entire encoded domain name may be no more than 255 octets long (including the label length octets)

Compression

Unfortunately, that’s not all.

Domain names get repeated a lot in DNS messages, and the 512 bytes of a UDP datagram can start to feel pretty limiting. So DNS also has a compression mechanism, where some suffix of a domain name can be replaced with a pointer to an earlier occurrence of that name.

So if the name memo.barrucadu.co.uk. appears in a message twice, the second occurrence could be represented as:

memo.barrucadu.co.uk.
memo.barrucadu.co.[pointer to uk.]
memo.barrucadu.[pointer to co.uk.]
memo.[pointer to barrucadu.co.uk.]
[pointer to memo.barrucadu.co.uk.]

But how do you distinguish between a regular label and a pointer? Well, remember that a label can’t be longer than 63 octets. And what’s 63 as an 8-bit binary number?

It’s 00111111.

There’s two whole bits there at the front which are completely wasted!

So pointers are encoded as the two-octet sequence 11[14-bit index into message].

Pretty clever.

Questions

                                    1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                                               |
    /                     QNAME                     /
    /                                               /
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                     QTYPE                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                     QCLASS                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Where,

QNAME is the domain name, which can be any length (so long as it’s properly encoded), it’s not padded to any specific size.
QTYPE is a 16-bit integer specifying the type of records the client is interested in. Which will usually be a record type (see the next subsection) or 255, meaning “all records”. There are a few other QTYPEs but those are less common.
QCLASS is a 16-bit integer specifying which network class the client is interested in. These days this will always be 1, or IN, for “internet”.⁷

We can now understand the question section of our dig example!

;; QUESTION SECTION:
;memo.barrucadu.co.uk.          IN      A

Means that it’s looking for an internet address record for memo.barrucadu.co.uk. (yes, it shows the type and class the other way around). That question is encoded as:

0x04 m e m o 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00  ; qname:  memo.barrucadu.co.uk.
0x00 0x01                                                   ; qtype:  A
0x00 0x01                                                   ; qclass: IN

Resource Records

The answer, authority, and additional sections are all a sequence of resource records:

                                    1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                                               |
    /                                               /
    /                      NAME                     /
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TYPE                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                     CLASS                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TTL                      |
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                   RDLENGTH                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
    /                     RDATA                     /
    /                                               /
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Where,

NAME is the domain name, which is variable-length like the QNAME of a question.
TYPE is a 16-bit integer specifying what sort of record this is. There are a fair few of these, but some common ones are:
- 1, an A record
- 2, a NS record
- 5, a CNAME record
- 28, a AAAA record (from RFC 3596)
- and plenty others
CLASS is a 16-bit integer specifying what network class this record applies to. Like the QCLASS, these days this will always be 1. Unless you’re specifically running some sort of old non-IP-based network for fun.
TTL is a 32-bit integer specifying the number of seconds that this record is valid for. This is important for caching purposes. Zero has a special meaning: it means that you can use the record to do whatever it is you’re doing right now, but that you can’t cache it at all.
RDLENGTH is a 16-bit integer specifying the length of the RDATA section.
RDATA is the record data, which is type- and class-specific. For example:
- IN A records hold an IPv4 address, as a 32-bit number
- IN NS and IN CNAME records hold a domain name
- IN AAAA records hold an IPv6 address, as a 128-bit number

Returning to our dig example, we had a few different resource records in the response. Let’s just look at the answer section:

;; ANSWER SECTION:
memo.barrucadu.co.uk.   292     IN      CNAME   barrucadu.co.uk.
barrucadu.co.uk.        292     IN      A       116.203.34.201

We have one IN CNAME record for memo.barrucadu.co.uk. and one IN A record for barrucadu.co.uk.. This is because, upon encountering a CNAME, resolution starts again with whatever name the CNAME refers to.⁸

Leaving out the name compression for simplicity, those records are encoded as:

0x04 m e m o 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00  ; name:     memo.barrucadu.co.uk.
0x00 0x05                                                   ; type:     CNAME
0x00 0x01                                                   ; class:    IN
0x00 0x00 0x01 0x24                                         ; ttl:      292
0x00 0x11                                                   ; rdlength: 17
0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00               ; rdata:    barrucadu.co.uk.

0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00               ; name:     barrucadu.co.uk.
0x00 0x01                                                   ; type:     A
0x00 0x01                                                   ; class:    IN
0x00 0x00 0x01 0x24                                         ; ttl:      292
0x00 0x04                                                   ; rdlength: 4
0x74 0xcb 0x22 0xc9                                         ; rdata:    116.203.34.201

Example DNS query & response

Returning to our dig memo.barrucadu.co.uk +noedns example from the beginning, we can now see the whole encoded query and response. I’ve included comments and linebreaks to make it clear what’s what.

Here’s the query:

;; header
0xb6 0x54 ; ID: 46676
0x01 0x00 ; flags: RD
0x00 0x01 ; QDCOUNT: 1
0x00 0x00 ; ANCOUNT: 0
0x00 0x00 ; NSCOUNT: 0
0x00 0x00 ; ARCOUNT: 0

;; question section
; memo.barrucadu.co.uk. A IN
0x04 m e m o 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x01 0x00 0x01

And here’s the response (omitting compression):

;; header
0xb6 0x54 ; ID: 46676
0x81 0x80 ; flags: QR, RD, RA
0x00 0x01 ; QDCOUNT: 1
0x00 0x02 ; ANCOUNT: 2
0x00 0x04 ; NSCOUNT: 4
0x00 0x00 ; ARCOUNT: 0

;; question section
; memo.barrucadu.co.uk. A IN
0x04 m e m o 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x01 0x00 0x01

;; answer section
; memo.barrucadu.co.uk. CNAME IN 292 barrucadu.co.uk.
0x04 m e m o 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x05 0x00 0x01 0x00 0x00 0x01 0x24 0x00 0x11 0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00
; barrucadu.co.uk. A IN 292 116.203.34.201
0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x01 0x00 0x01 0x00 0x00 0x01 0x24 0x00 0x04 0x74 0xcb 0x22 0xc9

;; authority section
; barrucadu.co.uk. NS IN 2975 ns-98.awsdns-12.com.
0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x02 0x00 0x01 0x00 0x00 0x0b 0x9f 0x00 0x15 0x05 n s - 9 8 0x09 a w s d n s - 1 2 0x03 c o m 0x00
; barrucadu.co.uk. NS IN 2975 ns-1520.awsdns-62.org.
0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x02 0x00 0x01 0x00 0x00 0x0b 0x9f 0x00 0x17 0x07 n s - 1 5 2 0 0x09 a w s d n s - 6 2 0x03 o r g 0x00
; barrucadu.co.uk. NS IN 2975 ns-1828.awsdns-36.co.uk.
0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x02 0x00 0x01 0x00 0x00 0x0b 0x9f 0x00 0x19 0x07 n s - 1 8 2 8 0x09 a w s d n s - 3 6 0x02 c o 0x02 u k 0x00
; barrucadu.co.uk. NS IN 2975 ns-763.awsdns-31.net.
0x09 b a r r u c a d u 0x02 c o 0x02 u k 0x00 0x00 0x02 0x00 0x01 0x00 0x00 0x0b 0x9f 0x00 0x16 0x06 n s - 7 6 3 0x09 a w s d n s - 3 1 0x03 n e t 0x00

And that’s that!

The DNS protocol isn’t very complicated. But it is somewhat fiddly, what with each record type having its own RDATA format, and the domain name compression. One big thing I learned implementing resolved is to always fuzz test your serialisation and deserialisation logic.

How resolution happens

When we ran dig memo.barrucadu.co.uk +noedns in the previous section, we got an answer. We found the IP address which memo.barrucadu.co.uk. refers to.

But how?

Well, dig tells us that it talked to some server at 185.12.64.2. But how did that server know? Does it have a copy of the entire DNS? Unlikely, since there are hundreds of millions of records in use.

The answer is that the server followed a process called recursive resolution. This is described in section 5.3.3 of RFC 1034:

See if we already know the answer (e.g. the relevant records are already cached), and return it to the client if so
Figure out the best nameservers to ask
Send them queries until one responds
Analyse the response:
- If the response answers the question, cache it and return it to the client
- If the response gives some better nameservers to use, cache them and go back to step 2
- If the response gives a CNAME, and this is not the answer, cache the CNAME record and start again with the new name
- If the response is an error or doesn’t make sense, go back to step 3

On the face of it this looks pretty straightforward… but on closer inspection that step 2 is doing a lot of work: how exactly do we “figure out the best nameservers to ask”?⁹

Well, step 4.b gives us a clue here: if the response gives some better nameservers to use, cache them and go back to step 2. So we don’t need to pick the correct nameservers at the very beginning. We only need to know about a nameserver which will be able to point us to a nameserver which knows that (or is closer to knowing that).

There are thirteen nameservers which, transitively, know about every domain name. These are the root nameservers, and they’re where recursive resolution starts.

You can find them at a.root-servers.net. through m.root-servers.net.

So you just point your recursive resolver at, say, j.root-servers.net. and… oh wait, we have a chicken-and-egg problem. Ultimately, you need to know their IP addresses. IANA, the Internet Assigned Numbers Authority, provides the “root hints” file, which has the IPv4 and IPv6 addresses of these root nameservers.

How do you download that file if you don’t have DNS working to resolve www.iana.org.? Look, you just need IP addresses to get DNS and DNS to get IP addresses. Use 1.1.1.1 or something while you get your fancy recursive resolver working.

Alright, let’s resolve memo.barrucadu.co.uk. recursively! Starting with:

$ dig memo.barrucadu.co.uk @j.root-servers.net
; <<>> DiG 9.16.27 <<>> memo.barrucadu.co.uk @j.root-servers.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48477
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 8, ADDITIONAL: 17
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;memo.barrucadu.co.uk.          IN      A

;; AUTHORITY SECTION:
uk.                     172800  IN      NS      dns1.nic.uk.
uk.                     172800  IN      NS      dns4.nic.uk.
uk.                     172800  IN      NS      nsa.nic.uk.
uk.                     172800  IN      NS      nsd.nic.uk.
uk.                     172800  IN      NS      nsc.nic.uk.
uk.                     172800  IN      NS      nsb.nic.uk.
uk.                     172800  IN      NS      dns3.nic.uk.
uk.                     172800  IN      NS      dns2.nic.uk.

;; ADDITIONAL SECTION:
dns1.nic.uk.            172800  IN      A       213.248.216.1
dns1.nic.uk.            172800  IN      AAAA    2a01:618:400::1
dns4.nic.uk.            172800  IN      A       43.230.48.1
dns4.nic.uk.            172800  IN      AAAA    2401:fd80:404::1
nsa.nic.uk.             172800  IN      A       156.154.100.3
nsa.nic.uk.             172800  IN      AAAA    2001:502:ad09::3
nsd.nic.uk.             172800  IN      A       156.154.103.3
nsd.nic.uk.             172800  IN      AAAA    2610:a1:1010::3
nsc.nic.uk.             172800  IN      A       156.154.102.3
nsc.nic.uk.             172800  IN      AAAA    2610:a1:1009::3
nsb.nic.uk.             172800  IN      A       156.154.101.3
nsb.nic.uk.             172800  IN      AAAA    2001:502:2eda::3
dns3.nic.uk.            172800  IN      A       213.248.220.1
dns3.nic.uk.            172800  IN      AAAA    2a01:618:404::1
dns2.nic.uk.            172800  IN      A       103.49.80.1
dns2.nic.uk.            172800  IN      AAAA    2401:fd80:400::1

;; Query time: 4 msec
;; SERVER: 2001:503:c27::2:30#53(2001:503:c27::2:30)
;; WHEN: Sat Apr 02 23:20:04 BST 2022
;; MSG SIZE  rcvd: 553

Alright, we now know the names and IP addresses of the uk. nameservers. Thanks, additional section!

On we go:

$ dig memo.barrucadu.co.uk @213.248.216.1
; <<>> DiG 9.16.27 <<>> memo.barrucadu.co.uk @213.248.216.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43056
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;memo.barrucadu.co.uk.          IN      A

;; AUTHORITY SECTION:
barrucadu.co.uk.        172800  IN      NS      ns-98.awsdns-12.com.
barrucadu.co.uk.        172800  IN      NS      ns-763.awsdns-31.net.
barrucadu.co.uk.        172800  IN      NS      ns-1520.awsdns-62.org.
barrucadu.co.uk.        172800  IN      NS      ns-1828.awsdns-36.co.uk.

;; Query time: 14 msec
;; SERVER: 213.248.216.1#53(213.248.216.1)
;; WHEN: Sat Apr 02 23:21:28 BST 2022
;; MSG SIZE  rcvd: 183

No additional section here, so we’ll need to resolve one of those nameservers. Back to the root!

$ dig ns-98.awsdns-12.com @j.root-servers.net
; <<>> DiG 9.16.27 <<>> ns-98.awsdns-12.com @j.root-servers.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8418
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 13, ADDITIONAL: 27
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;ns-98.awsdns-12.com.           IN      A

;; AUTHORITY SECTION:
com.                    172800  IN      NS      a.gtld-servers.net.
com.                    172800  IN      NS      b.gtld-servers.net.
com.                    172800  IN      NS      c.gtld-servers.net.
com.                    172800  IN      NS      d.gtld-servers.net.
com.                    172800  IN      NS      e.gtld-servers.net.
com.                    172800  IN      NS      f.gtld-servers.net.
com.                    172800  IN      NS      g.gtld-servers.net.
com.                    172800  IN      NS      h.gtld-servers.net.
com.                    172800  IN      NS      i.gtld-servers.net.
com.                    172800  IN      NS      j.gtld-servers.net.
com.                    172800  IN      NS      k.gtld-servers.net.
com.                    172800  IN      NS      l.gtld-servers.net.
com.                    172800  IN      NS      m.gtld-servers.net.

;; ADDITIONAL SECTION:
a.gtld-servers.net.     172800  IN      A       192.5.6.30
b.gtld-servers.net.     172800  IN      A       192.33.14.30
c.gtld-servers.net.     172800  IN      A       192.26.92.30
d.gtld-servers.net.     172800  IN      A       192.31.80.30
e.gtld-servers.net.     172800  IN      A       192.12.94.30
f.gtld-servers.net.     172800  IN      A       192.35.51.30
g.gtld-servers.net.     172800  IN      A       192.42.93.30
h.gtld-servers.net.     172800  IN      A       192.54.112.30
i.gtld-servers.net.     172800  IN      A       192.43.172.30
j.gtld-servers.net.     172800  IN      A       192.48.79.30
k.gtld-servers.net.     172800  IN      A       192.52.178.30
l.gtld-servers.net.     172800  IN      A       192.41.162.30
m.gtld-servers.net.     172800  IN      A       192.55.83.30
a.gtld-servers.net.     172800  IN      AAAA    2001:503:a83e::2:30
b.gtld-servers.net.     172800  IN      AAAA    2001:503:231d::2:30
c.gtld-servers.net.     172800  IN      AAAA    2001:503:83eb::30
d.gtld-servers.net.     172800  IN      AAAA    2001:500:856e::30
e.gtld-servers.net.     172800  IN      AAAA    2001:502:1ca1::30
f.gtld-servers.net.     172800  IN      AAAA    2001:503:d414::30
g.gtld-servers.net.     172800  IN      AAAA    2001:503:eea3::30
h.gtld-servers.net.     172800  IN      AAAA    2001:502:8cc::30
i.gtld-servers.net.     172800  IN      AAAA    2001:503:39c1::30
j.gtld-servers.net.     172800  IN      AAAA    2001:502:7094::30
k.gtld-servers.net.     172800  IN      AAAA    2001:503:d2d::30
l.gtld-servers.net.     172800  IN      AAAA    2001:500:d937::30
m.gtld-servers.net.     172800  IN      AAAA    2001:501:b1f9::30

;; Query time: 3 msec
;; SERVER: 2001:503:c27::2:30#53(2001:503:c27::2:30)
;; WHEN: Sat Apr 02 23:22:36 BST 2022
;; MSG SIZE  rcvd: 844

We’ve got the com. nameservers. Next!¹⁰

$ dig ns-98.awsdns-12.com @192.5.6.30
; <<>> DiG 9.16.27 <<>> ns-98.awsdns-12.com @192.5.6.30
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59687
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 9
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ns-98.awsdns-12.com.           IN      A

;; AUTHORITY SECTION:
awsdns-12.com.          172800  IN      NS      g-ns-13.awsdns-12.com.
awsdns-12.com.          172800  IN      NS      g-ns-588.awsdns-12.com.
awsdns-12.com.          172800  IN      NS      g-ns-1164.awsdns-12.com.
awsdns-12.com.          172800  IN      NS      g-ns-1740.awsdns-12.com.

;; ADDITIONAL SECTION:
g-ns-13.awsdns-12.com.  172800  IN      A       205.251.192.13
g-ns-13.awsdns-12.com.  172800  IN      AAAA    2600:9000:5300:d00::1
g-ns-588.awsdns-12.com. 172800  IN      A       205.251.194.76
g-ns-588.awsdns-12.com. 172800  IN      AAAA    2600:9000:5302:4c00::1
g-ns-1164.awsdns-12.com. 172800 IN      A       205.251.196.140
g-ns-1164.awsdns-12.com. 172800 IN      AAAA    2600:9000:5304:8c00::1
g-ns-1740.awsdns-12.com. 172800 IN      A       205.251.198.204
g-ns-1740.awsdns-12.com. 172800 IN      AAAA    2600:9000:5306:cc00::1

;; Query time: 23 msec
;; SERVER: 192.5.6.30#53(192.5.6.30)
;; WHEN: Sat Apr 02 23:24:01 BST 2022
;; MSG SIZE  rcvd: 317

Nearly there… each query gets us a step or two closer.

$ dig ns-98.awsdns-12.com @205.251.192.13
; <<>> DiG 9.16.27 <<>> ns-98.awsdns-12.com @205.251.192.13
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43579
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ns-98.awsdns-12.com.           IN      A

;; ANSWER SECTION:
ns-98.awsdns-12.com.    172800  IN      A       205.251.192.98

;; AUTHORITY SECTION:
awsdns-12.com.          172800  IN      NS      g-ns-1164.awsdns-12.com.
awsdns-12.com.          172800  IN      NS      g-ns-13.awsdns-12.com.
awsdns-12.com.          172800  IN      NS      g-ns-1740.awsdns-12.com.
awsdns-12.com.          172800  IN      NS      g-ns-588.awsdns-12.com.

;; ADDITIONAL SECTION:
g-ns-1164.awsdns-12.com. 172800 IN      A       205.251.196.140
g-ns-1164.awsdns-12.com. 172800 IN      AAAA    2600:9000:5304:8c00::1
g-ns-13.awsdns-12.com.  172800  IN      A       205.251.192.13
g-ns-13.awsdns-12.com.  172800  IN      AAAA    2600:9000:5300:d00::1
g-ns-1740.awsdns-12.com. 172800 IN      A       205.251.198.204
g-ns-1740.awsdns-12.com. 172800 IN      AAAA    2600:9000:5306:cc00::1
g-ns-588.awsdns-12.com. 172800  IN      A       205.251.194.76
g-ns-588.awsdns-12.com. 172800  IN      AAAA    2600:9000:5302:4c00::1

;; Query time: 13 msec
;; SERVER: 205.251.192.13#53(205.251.192.13)
;; WHEN: Sat Apr 02 23:24:41 BST 2022
;; MSG SIZE  rcvd: 333

We’ve got an IP address for ns-98.awsdns-12.com.! Now we can answer our original question:

$ dig memo.barrucadu.co.uk @205.251.192.98
; <<>> DiG 9.16.27 <<>> memo.barrucadu.co.uk @205.251.192.98
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26684
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;memo.barrucadu.co.uk.          IN      A

;; ANSWER SECTION:
memo.barrucadu.co.uk.   300     IN      CNAME   barrucadu.co.uk.
barrucadu.co.uk.        300     IN      A       116.203.34.201

;; AUTHORITY SECTION:
barrucadu.co.uk.        172800  IN      NS      ns-1520.awsdns-62.org.
barrucadu.co.uk.        172800  IN      NS      ns-1828.awsdns-36.co.uk.
barrucadu.co.uk.        172800  IN      NS      ns-763.awsdns-31.net.
barrucadu.co.uk.        172800  IN      NS      ns-98.awsdns-12.com.

;; Query time: 13 msec
;; SERVER: 205.251.192.98#53(205.251.192.98)
;; WHEN: Sat Apr 02 23:25:37 BST 2022
;; MSG SIZE  rcvd: 213

And we’re done, after 6 requests to other nameservers. And in a real nameserver implementation, we’d be checking before each of those requests whether we already had the answer cached, so likely some of them (eg, the request to find the com. nameservers) wouldn’t have been needed.

Zones?

In the previous section, it looked very much like the DNS was broken up into subtrees (or “zones”, if you will) based on the label structure:

The . nameservers knew about the com. and uk. nameservers, but couldn’t answer queries about subdomains of those directly
Similarly, the uk. nameservers knew about the nameservers for barrucadu.co.uk., but not any of its other records
And likewise for the com. nameservers and awsdns-12.com.

This makes sense. Imagine if the root nameservers knew every DNS record! Their databases would be huge! It would be infeasible to run a handful of servers which know hundreds of millions of records and which the whole world uses.

So . is a zone. And uk. is a zone. And barrucadu.co.uk. is a zone. All of the TLDs are zones, and every domain you can buy creates a new zone. A zone can be bigger than a single label, e.g. foo.bar.baz.barrucadu.co.uk. is in the barrucadu.co.uk. zone unless I delegate it to someone else, by creating some NS records for, say, baz.barrucadu.co.uk.

That’s exactly how registering a domain name works, by the way. The registrars have privileged access to the TLD nameservers, and you pay them some money for them to send a message to the nameservers saying “please delegate barrucadu to these other nameservers”.

Zones are traditionally represented in a textual format defined in RFC 1035.¹¹ You’ve seen this format before: it’s the format dig gives its responses in and it’s the format of the root hints file (and the root zone file, also provided by IANA).

Here’s the zone file I use for my LAN DNS:

$ORIGIN lan.

@ 300 IN SOA @ @ 4 300 300 300 300

router         300 IN A     10.0.0.1

nyarlathotep   300 IN A     10.0.0.3
*.nyarlathotep 300 IN CNAME nyarlathotep

help           300 IN CNAME nyarlathotep
*.help         300 IN CNAME help

nas            300 IN CNAME nyarlathotep

It’s a list of records, but note that they all use relative domain names (no dot at the end). I could write them as absolute domain names, but that would be repetitive, and who doesn’t want to golf their zone files? The $ORIGIN line at the top is used to complete any relative names, and the @ is an alias for the origin, so this zone file could also be written as:

lan. 300 IN SOA lan. lan. 4 300 300 300 300

router.lan.         300 IN A     10.0.0.1

nyarlathotep.lan.   300 IN A     10.0.0.3
*.nyarlathotep.lan. 300 IN CNAME nyarlathotep.lan.

help.lan.           300 IN CNAME nyarlathotep.lan.
*.help.lan.         300 IN CNAME help.lan.

nas.lan.            300 IN CNAME nyarlathotep.lan.

Zones come in two types: authoritative (also just called a zone, or a master zone) and non-authoritative (also called hints). An authoritative zone has a SOA record, and causes the nameserver to give authoritative responses to questions which fall into that zone.¹²

Non-authoritative zones don’t, and are primarily useful as a sort of permanent cache. Take the root hints file for example: all recursive resolvers need to know the NS records for .. But they should not act as if they’re authoritative for ., they just know a little bit about it.

Since any nameserver could claim to be authoritative for any zone it wants, and I’m sure malicious nameservers often do try to claim ownership of big sites like google.com., how does the DNS work?

It works on trust.

You trust that the root nameservers will give you the correct nameservers for all the TLDs. You then, in turn, trust that the TLD nameservers will give the correct nameservers for the domains registered under those TLDs. And so on, all the way down to the domain you actually want to resolve.

Not every nameserver operator will be equally trustworthy or competent, so that trust does erode somewhat as you move further and further away from the root, but if you do some basic validation of DNS responses (e.g. rejecting a response with NS records for a domain which is not a subdomain of the zone which you know this nameserver to be authoritative for), you can do pretty well.

Types of nameserver

There are, broadly speaking, three sorts of nameservers you see:

Authoritative nameservers are the source of truth for records about a given zone. Typically, these refuse to answer questions for other zones. These set the AA flag for queries falling into their zones and return a “name error” response if a name they are authoritative for doesn’t exist.¹³

In resolved this is implemented by the dns_resolver::nonrecursive module.
Recursive nameservers (or recursive resolvers) perform recursive resolution for anyone who wants it. For example: 1.1.1.1, 8.8.8.8, and the nameserver your ISP operates. Typically, these are not authoritative for any zones. Recursive nameservers are convenient because the client doesn’t need to implement the recursive algorithm themselves: they can just fire off a query and get the response.¹⁴

In resolved this is implemented by the dns_resolver::recursive module.
Forwarding nameservers (or forwarding resolvers) forward all queries to a recursive resolver, rather than do the recursive resolution themselves. Typically, these are not authoritative for any zones. Forwarding nameservers are simpler than recursive nameservers, and they’re useful for the same reason any other sort of proxy is: they can increase cache hit rate (by having many clients go through the forwarding resolver), and selectively falsify or block records.¹⁵

In resolved this is implemented by the dns_resolver::forwarding module.

Of course, there’s no reason a single nameserver can’t do all of those things at the same time!

Consider bind, the big-name nameserver. Check out its configuration documentation: it says any zone can authoritative, forwarded, or hints, and the allow-recursion option configures whether recursive queries for zones the server doesn’t know about are allowed.

My resolved server by default supports authoritative zones and recursive resolution. It may not appear to support bind-style zone-specific forwarding, but you could implement that with a hints file containing NS records for the zone you want to forward, and there is a command-line flag to forward all recursive queries to some other server.

The reason you’d want to make a nameserver do only one sort of resolution is to make operation simpler. In particular, it’s good practice for internet-facing authoritative nameservers to only perform non-recursive resolution. Answering or rejecting queries based only on local data makes them have much more predictable performance.

DNS doesn’t “propagate”

When I first got into all this web development stuff, the common wisdom was that DNS changes took 24 to 48 hours to propagate. But having seen some details of the DNS protocol and how recursive resolution works, does that really make sense? Shouldn’t changes be visible as soon as the TTL of the old record expires? And shouldn’t new records be visible immediately? Why do changes need to propagate? Where do they propagate to?

Propagation implies a push model, where you make your changes and then they get sent to the resolvers which need them. But that’s not what happens at all: instead, caches expire.

Ok, there are two cases in which DNS does propagate:

If you update your domain’s NS records, your registrar needs to push those changes to the TLD nameservers. Apparently this used to be kind of slow, like, 20+ years ago. These days it’s very fast.
If you run a very high traffic authoritative nameserver, you’ll operate multiple instances of it around the world to improve reliability and latency. So if you change a record, that change needs to be pushed out to all your servers. But this should take under a minute unless something is very wrong.

My hunch is that this 24 to 48 hour window came from:

Registrars being slow to update the TLD nameservers once upon a time
ISPs running notoriously poorly-behaving nameservers

Ah, ISP DNS. Almost the first thing any self-respecting nerd changes when setting up a new home network. They often do nefarious things like redirect misspelled domain names to ad-covered search pages, trying to profit off your typos. And, as it turns out, a lot of them ignore record TTLs, and will cache something for a long period if they feel like it.

How long? Well, I’ve seen reports of 24 hours…

Well, no matter what the cause of the occasional slow DNS update is—though I can’t say I’ve experienced slow DNS updates in a very long time, and updates are evidently fast enough for changing an A record to be considered a viable failover mechanism for big sites—“propagation” is the wrong mental model.

DNS is pull, not push.

Are RFCs 1034 and 1035 enough?

I’ve been running resolved for my LAN for about two and a half weeks now. And it’s working pretty well! Ok, I have implemented a few more RFCs:

RFC 2782, which defines the SRV record type, because Minecraft can use SRV records to detect the correct port number of a server (but that’s totally optional, you can also just type in the port in the game client).
RFC 3596, which defines the AAAA¹⁶ type, because I wanted to be able to read the official, and unchanged, root hints file. But I don’t have IPv6 at home so I could make do without this.
RFC 6761, which defines some zones with special behaviour, which I distribute as zone files. This was actually the motivation for me to implement authoritative zones, previously I was only going to support hints and Pi-hole-like ad-blocking through hosts files.

So there are a few things. But what I’ve covered in this memo is, more or less, enough to implement a working nameserver. You’d need to look up the formats of a few more common record types in RFC 1035, and also the full algorithm for non-recursive resolution in RFC 1034 (which I glossed over in a single sentence), but the point is that DNS is not very complicated, even today.

There have been new record types; there have been security extensions; there have been clarifications; some zones have been given special meaning. But all of that is optional.

Certainly for a home network, RFCs 1034 and 1035 are enough.

You could even call it a “NoSQL” database, if you really must.↩︎
The +noedns flag turns off some extensions to the basic DNS protocol, which I’m not covering for simplicity.↩︎
Ok, I’ve actually known this one for a while, because I’m the sort of person to pedantically bring that up.↩︎
Well, this plus source port matching. There are also some other security mechanisms DNS clients sometimes use to prevent spoofed responses, like randomly capitalising letters in the question names (since DNS is case-insensitive), and checking that the response from the server uses the same random capitalisation.↩︎
Fun fact, Alpine Linux doesn’t support DNS over TCP, so it can break if a truncated response doesn’t include enough complete records for it to make progress.↩︎
And also makes encoded domain names work as null-terminated strings in C in the (very common) case where none of the labels contain a null byte. What a fortuitous coincidence!↩︎
It feels kind of wasteful that we effectively throw away 16 whole bits for each question and record on this historical artefact. UDP messages are short, so we compress domain names to squeeze out a little extra space, but then we waste a bunch like this! Even worse, there never were very many network classes: RFC 1035 only defines four. Did the IETF really expect there to be so many non-internet networks in the future?↩︎
Unless the query was for, say, IN CNAME memo.barrucadu.co.uk.. More on this in how resolution happens.↩︎
That step 1 is also doing a surprising amount of work if your nameserver supports authoritative zones (see next section). For the full gory details, see section 4.3.2 of RFC 1034.↩︎
I know it’s a necessary consequence of how DNS works, but I still find it pretty cool that there are servers which know about literally every com. (or uk., or net., etc) domain name.↩︎
Like the DNS protocol, this format appears to be straightforward but is annoyingly fiddly when you come to implement it. It’s almost (but not quite!) line-oriented, just about every field is optional, and there are two fields which can be written in either order. Just why?↩︎
See the next section for more on authoritative nameservers.↩︎
Note that there’s a difference between a domain not existing and a domain existing but having no records at all (or just no records matching the current query). An authoritative nameserver should only return a name error if it actually doesn’t exist.↩︎
In fact, the resolver your operating system uses is probably what’s called a “stub resolver”, rather than a recursive resolver. Try configuring your DNS resolver in /etc/resolv.conf to be one of the root nameservers, rather than a recursive resolver: it won’t work.↩︎
The Pi-hole is a forwarding resolver which blocks advertising domains by returning a fake A record pointing to some unusable IP address, like 0.0.0.0.↩︎
I used to read this as “A-A-A-A” but, having now typed and said it a bunch, I’ve switched to the less tounge-twistery “quad-A”. I wonder what actual networking people say.↩︎

Implementing a size-bounded LRU cache with expiring entries for my DNS server (in Rust)

2022-03-07T00:00:00Z

I’ve spent the last week or so implementing a recursive DNS resolver in Rust. I’m not very good at either of those things, so this has been a bit of a learning experience.

This memo is about how I ended up implementing the caching layer. You don’t need to know much about DNS to follow this memo, just some basics:

DNS is a distributed eventually-consistent database, and timeouts are how it achieves that eventual consistency.
The keys in this database are domain names (like www.barrucadu.co.uk) and the values are resource records (RRs for short).
A resource record has a type (like “A”, or “CNAME”), a class (like “IN”, for INternet), a ttl (how long it’s valid for), and some data.
Finally, the format of that data depends on the type (but not on the class).

Let’s make this a bit more concrete:

// we'll need these later
use priority_queue::PriorityQueue;
use std::cmp::Reverse;
use std::collections::HashMap;
use std::net::Ipv4Addr;
use std::time::{Duration, Instant};

/// A resource record, or RR, is something we receive from another
/// nameserver, or which we send in answer to a client's query.
#[derive(Debug)]
pub struct ResourceRecord {
    pub name: DomainName,
    pub rtype: RecordTypeWithData,
    pub rclass: RecordClass,
    pub ttl: Duration,
}

/// A domain name is a sequence of "labels", eg, `www.barrucadu.co.uk`
/// is made up of the labels `["www", "barrucadu", "co", "uk", ""]`.
/// The final empty label is the root domain, which we normally don't
/// bother writing, but is meaningful in some contexts.
///
/// Incidentally, the final empty label means that in the DNS wire
/// format, names are null-terminated.  I'm sure this isn't a
/// coincidence.
///
/// Labels are ASCII and case-insensitive, so make sure to construct
/// them correctly!
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct DomainName {
    pub labels: Vec>,
}

/// Record data depends on its type, so this enum has one variant for
/// each type.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RecordTypeWithData {
    A { address: Ipv4Addr },
    CNAME { cname: DomainName }, // many more omitted
}

/// We'll also need a notion of record type *without* data.
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub enum RecordType {
    A,
    CNAME, // many more omitted
}

impl RecordTypeWithData {
    pub fn rtype(&self) -> RecordType {
        match self {
            RecordTypeWithData::A { .. } => RecordType::A,
            RecordTypeWithData::CNAME { .. } => RecordType::CNAME,
            // many more omitted
        }
    }
}

/// The record class identifies which sort of network the record is
/// for.  For the purposes of this memo, let's only consider the
/// internet.
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
pub enum RecordClass {
    IN,
}

Before we go any further, there’s one final prerequisite. When you ask a DNS server for some records, you don’t say,

Give me all records of such-and-such record type and record class for www.barrucadu.co.uk.

You instead ask in terms of a query type and query class.

In this memo, you can think of those as just the record types and classes we’ve just defined, plus a wildcard to mean “match anything”:

#[derive(Debug, Copy, Clone)]
pub enum QueryType {
    Record(RecordType),
    Wildcard,
}

// does a record match a query, or a query match a record?  this is
// the way 'round I went for, but the other choice would make just as
// much sense.
impl RecordType {
    pub fn matches(&self, qtype: &QueryType) -> bool {
        match qtype {
            QueryType::Wildcard => true,
            QueryType::Record(rtype) => rtype == self,
        }
    }
}

#[derive(Debug, Copy, Clone)]
pub enum QueryClass {
    Record(RecordClass),
    Wildcard,
}

impl RecordClass {
    pub fn matches(&self, qclass: &QueryClass) -> bool {
        match qclass {
            QueryClass::Wildcard => true,
            QueryClass::Record(rclass) => rclass == self,
        }
    }
}

There are a few more in reality, but they’re not important for our purposes.

So we’ll use the Record* types to put values into the cache and the Query* types to get values from the cache.

A Simple Cache

Right, what’s the simplest possible cache we could implement?

Perhaps something like this:¹

pub struct SimpleCache {
    entries: HashMap>,
}

impl SimpleCache {
    pub fn new() -> Self {
        Self {
            entries: HashMap::new(),
        }
    }

To put something in the cache, you just add it to the appropriate Vec:

    pub fn insert(&mut self, name: &DomainName, rr: ResourceRecord) {
        let entry = (rr.rtype, rr.rclass, Instant::now() + rr.ttl);
        if let Some(entries) = self.entries.get_mut(name) {
            entries.push(entry);
        } else {
            self.entries.insert(name.clone(), vec![entry]);
        }
    }

What if the user inserts the same record twice?

Well, what about it? This is a proof-of-concept! The DNS resolver will return duplicate records I guess! Moving swiftly on…

To get something from the cache, just iterate over the appropriate Vec, pulling out all the records with the right type and class:

    pub fn get(
        &self,
        name: &DomainName,
        qtype: QueryType,
        qclass: QueryClass,
    ) -> Vec {
        let now = Instant::now();
        if let Some(entries) = self.entries.get(name) {
            let mut rrs = Vec::with_capacity(entries.len());
            for (rtype_with_data, rclass, expires) in entries {
                if rtype_with_data.rtype().matches(&qtype) && rclass.matches(&qclass) {
                    rrs.push(ResourceRecord {
                        name: name.clone(),
                        rtype: rtype_with_data.clone(),
                        rclass: *rclass,
                        ttl: expires.saturating_duration_since(now),
                    });
                }
            }
            rrs
        } else {
            Vec::new()
        }
    }
}

What if a record has expired?

Proof-of-concept! The caller can deal with that by checking expiration times or something!

So, this was the caching implementation I started with. It works, but it has some problems:

There’s no deduplication.
There’s no expiration.
There’s no limit on the number of records.
All domains get put in one HashMap, totally ignoring their hierarchical label structure.
Getting records of one type involves iterating through records of another type.

But it’s better than no cache!

A Better Cache

Ok, how do we do better? The most egregious problems with the simple cache are the duplicate entries and the unbounded growth.

Using something that takes the hierarchical structure of domain names into account, like a trie, would also be nice, but I’m not dealing with enough live cache entries for that to be a concern yet.

So, how do we remove entries?

Well, we could periodically iterate over the entire cache, removing all expired entries. But if entries have a long expiration time, or just get accessed frequently enough, they won’t expire. So relying on expiration isn’t enough, we also need to occasionally remove live entries.

This sounds like a job for an LRU² cache: a size-bounded LRU cache with expiring entries for my DNS server!

Before jumping straight to the struct definition, let’s think about how to model this:

To solve the problem of iterating through records of unrelated types, we’ll need to subdivide the entries by type as well as domain name.
We’ll need to keep track of the most recent time each record has been accessed, so when the cache is full of unexpired records we can work out which one to evict first.
But the cache may be big. There could be hundreds or thousands of domains in there, each likely with multiple records. Iterating through the whole thing to find records to evict is a bad choice. We need a more efficient data structure to map from eviction priority to domain name.
For similar reasons, we don’t want to have to iterate through the entire cache to work out how big it is.

My usual mantra for designing data structures is to “make illegal states unrepresentable”, but I don’t think that will work here. To make this cache efficient, we’ll need to denormalise the data, and make our code ensure the correct invariants hold. Testing helps with this (and indeed testing did find some bugs in my implementation).

So I decided to use a pair of priority queues³ to efficiently track (1) which domain is next to have an expiring record, and (2) which domain has been least recently used. I also decided to keep track of sizes and times throughout the data structure, rather than just in the records.

Here’s the new cache data structure:

#[derive(Debug)]
pub struct BetterCache {
    /// Cached records, indexed by domain name.
    entries: HashMap,

    /// Priority queue of domain names ordered by access times.
    ///
    /// When the cache is full and there are no expired records to
    /// prune, domains will instead be pruned in LRU order.
    ///
    /// INVARIANT: the domains in here are exactly the domains in
    /// `entries`.
    access_priority: PriorityQueue>,

    /// Priority queue of domain names ordered by expiry time.
    ///
    /// When the cache is pruned, expired records are removed first.
    ///
    /// INVARIANT: the domains in here are exactly the domains in
    /// `entries`.
    expiry_priority: PriorityQueue>,

    /// The number of records in the cache.
    ///
    /// INVARIANT: this is the sum of the `size` fields of the
    /// entries.
    current_size: usize,

    /// The desired maximum number of records in the cache.
    desired_size: usize,
}

#[derive(Debug)]
struct CachedDomainRecords {
    /// The time this record was last read at.
    last_read: Instant,

    /// When the next RR expires.
    ///
    /// INVARIANT: this is the minimum of the expiry times of the RRs.
    next_expiry: Instant,

    /// How many records there are.
    ///
    /// INVARIANT: this is the sum of the vector lengths in `records`.
    size: usize,

    /// The records, further divided by record type.
    ///
    /// INVARIANT: the `RecordType` and `RecordTypeWithData` match.
    records: HashMap>,
}

impl BetterCache {
    pub fn new() -> Self {
        Self::with_desired_size(512)
    }

    pub fn with_desired_size(desired_size: usize) -> Self {
        if desired_size == 0 {
            panic!("cannot create a zero-size cache");
        }

        Self {
            // `desired_size / 2` is a compromise: most domains will
            // have more than one record, so `desired_size` would be
            // too big for the `entries`.
            entries: HashMap::with_capacity(desired_size / 2),
            access_priority: PriorityQueue::with_capacity(desired_size),
            expiry_priority: PriorityQueue::with_capacity(desired_size),
            current_size: 0,
            desired_size,
        }
    }

There are some invariants there in the comments. I’d prefer not to have those, but I don’t think there’s any getting around it given that we want better than linear time eviction.

This is substantially more complex than the SimpleCache, and the operations we’re about to define on it are too. Make sure this all makes sense before continuing. In particular, you might notice that I’ve opted to have the LRU eviction expire entire domain names, rather than individual records within them.

Let’s go through the new operations in order of complexity: querying, eviction, and insertion.

Getting things out

This isn’t too bad:

    /// Get an entry from the cache.
    ///
    /// The TTL in the returned `ResourceRecord` is relative to the
    /// current time - not when the record was inserted into the
    /// cache.
    ///
    /// This entry may have expired: if so, the TTL will be 0.
    /// Consumers MUST check this before using the record!
    pub fn get(
        &mut self,
        name: &DomainName,
        qtype: &QueryType,
        qclass: &QueryClass,
    ) -> Vec {
        if let Some(entry) = self.entries.get_mut(name) {
            let now = Instant::now();
            let mut rrs = Vec::new();
            match qtype {
                QueryType::Wildcard => {
                    for tuples in entry.records.values() {
                        to_rrs(name, qclass, now, tuples, &mut rrs);
                    }
                }
                QueryType::Record(rtype) => {
                    if let Some(tuples) = entry.records.get(rtype) {
                        to_rrs(name, qclass, now, tuples, &mut rrs);
                    }
                }
            }
            if !rrs.is_empty() {
                entry.last_read = now;
                self.access_priority
                    .change_priority(name, Reverse(entry.last_read));
            }
            rrs
        } else {
            Vec::new()
        }
    }
}

This is quite similar to what we had before. Sure, the extra layer of indirection adds a tad more complication, and there’s now a write operation in here (updating last_read and access_priority, which takes log time), but other than that nothing complex.

The to_rrs function just exists to prevent some code duplication:

/// Helper for `get_without_checking_expiration`: converts the cached
/// record tuples into RRs.
fn to_rrs(
    name: &DomainName,
    qclass: &QueryClass,
    now: Instant,
    tuples: &[(RecordTypeWithData, RecordClass, Instant)],
    rrs: &mut Vec,
) {
    for (rtype, rclass, expires) in tuples {
        if rclass.matches(qclass) {
            rrs.push(ResourceRecord {
                name: name.clone(),
                rtype: rtype.clone(),
                rclass: *rclass,
                ttl: expires.saturating_duration_since(now),
            });
        }
    }
}

If you’re following along at home, put that definition outside the impl BetterCache block.

Evicting things

Here’s are the simplest three functions in the entire impl:

    /// Delete all expired records, and then enough
    /// least-recently-used records to reduce the cache to the desired
    /// size.
    ///
    /// Returns the number of records deleted.
    pub fn prune(&mut self) -> usize {
        if self.current_size <= self.desired_size {
            return 0;
        }

        let mut pruned = self.remove_expired();

        while self.current_size > self.desired_size {
            pruned += self.remove_least_recently_used();
        }

        pruned
    }

    /// Helper for `prune`: deletes all records associated with the
    /// least recently used domain.
    ///
    /// Returns the number of records removed.
    fn remove_least_recently_used(&mut self) -> usize {
        if let Some((name, _)) = self.access_priority.pop() {
            self.expiry_priority.remove(&name);

            if let Some(entry) = self.entries.remove(&name) {
                let pruned = entry.size;
                self.current_size -= pruned;
                pruned
            } else {
                0
            }
        } else {
            0
        }
    }

    /// Delete all expired records.
    ///
    /// Returns the number of records deleted.
    pub fn remove_expired(&mut self) -> usize {
        let mut pruned = 0;

        loop {
            let before = pruned;
            pruned += self.remove_expired_step();
            if before == pruned {
                break;
            }
        }

        pruned
    }

So simple! So straightforward! If only all my code could be like this.

prune shrinks the cache to the desired size by removing the expired entries and then removing enough domains (in LRU order) to get below the target.
remove_least_recently_used pops an entry from the access_priority queue, removes it from the expiry_priority queue (which takes log time), and deletes it from the top-level entries map. It also updates the current_size, and returns the number of records it just deleted.
remove_expired is deceptively simple. It looks easy at first glance, but it’s calling this remove_expired_step function in a loop, until no more get removed.

Removing an entire domain is easy, but removing individual records from a domain is harder:

The size of the domain will change.
The next_expiry of the domain may change.
Those changes need to be reflected in the top-level current_size and expiry_priority fields.
But if it’s the last record in the domain we should remove that entirely.

Additionally, the queue gives us the domain name, and there may be one or more expiring records in it (or even zero, but that would be a bug).

With all that said, here’s the implementation:

    /// Helper for `remove_expired`: looks at the next-to-expire
    /// domain and cleans up expired records from it.  This may delete
    /// more than one record, and may even delete the whole domain.
    ///
    /// Returns the number of records removed.
    fn remove_expired_step(&mut self) -> usize {
        if let Some((name, Reverse(expiry))) = self.expiry_priority.pop() {
            let now = Instant::now();

            if expiry > now {
                self.expiry_priority.push(name, Reverse(expiry));
                return 0;
            }

            if let Some(entry) = self.entries.get_mut(&name) {
                let mut pruned = 0;

                let rtypes = entry.records.keys().cloned().collect::>();
                let mut next_expiry = None;
                for rtype in rtypes {
                    if let Some(tuples) = entry.records.get_mut(&rtype) {
                        let len = tuples.len();
                        tuples.retain(|(_, _, expiry)| expiry > &now);
                        pruned += len - tuples.len();
                        for (_, _, expiry) in tuples {
                            match next_expiry {
                                None => next_expiry = Some(*expiry),
                                Some(t) if *expiry < t => next_expiry = Some(*expiry),
                                _ => (),
                            }
                        }
                    }
                }

                entry.size -= pruned;

                if let Some(ne) = next_expiry {
                    entry.next_expiry = ne;
                    self.expiry_priority.push(name, Reverse(ne));
                } else {
                    self.entries.remove(&name);
                    self.access_priority.remove(&name);
                }

                self.current_size -= pruned;
                pruned
            } else {
                self.access_priority.remove(&name);
                0
            }
        } else {
            0
        }
    }

It’s pretty complex. We could describe it in pseudocode like so:

Pop the next expiring domain from the queue.
Check the current time:
- If the expiry time is in the future, put it back in the queue and return.
- Otherwise, get the cached records:
  - If there are no cached records, remove the domain from the access queue and return.
  - Otherwise:
    1. Iterate through all the records and check if each should expire:
      - If so, remove the record.
      - Otherwise, keep track of the soonest future expiry time seen.
    2. Check if this removed all the records:
      - If so, remove the domain from the cache.
      - Otherwise, put it back in the expiry queue with the new expiry time.
    3. Update the size fields.

In outline, fairly simple. In implementation, not fairly simple. Maybe someone better at Rust would be able to write this in a clearer way, but this is what I’ve got.

Incidentally, one of the bugs found by testing (by inserting randomly generated entries, pruning the expired ones, and checking the invariants) was that I had that entry.size -= pruned; inside the for rtype in rtypes, which means that if a domain had multiple records of different types expire at the same time, the size would be wrong.

Putting things in

Unfortunately, this is the most complex part. Adding a new entry to our cache involves a lot of work to maintain those invariants, especially if we also want to handle duplicate entries.

So before getting to the code, let’s think about what the behaviour should be.

If the domain name isn’t in the cache at all, we need to:
- Insert a CachedDomainRecords containing just our new record.
- Add the domain to the access_priority queue.
- Add the domain to the expiry_priority queue.
If the domain name is in the cache but it has no records of this type, we need to:
- Add the record to the existing domain.
- Update the domain’s size and last_read.
- Update the access_priority queue.
- Update the domain’s next_expiry and the expiry_priority queue, if this new record expires sooner than the current soonest.
If the domain name is in the cache and it does have records of this type, we need to:
- Check if there is a duplicate record, and if so:
  - Delete it.
  - Decrement the domain’s size and the current_size.
  - Update the domain’s next_expiry and the expiry_priority queue if the duplicate would have been the soonest record to expire.
- Then, the same as in case (2).

Additionally, in all cases, we need to increment the current_size.

Got all that? Here’s the code:

    /// Insert an entry into the cache.
    pub fn insert(&mut self, record: &ResourceRecord) {
        let now = Instant::now();
        let rtype = record.rtype.rtype();
        let expiry = Instant::now() + record.ttl;
        let tuple = (record.rtype.clone(), record.rclass, expiry);
        if let Some(entry) = self.entries.get_mut(&record.name) {
            if let Some(tuples) = entry.records.get_mut(&rtype) {
                let mut duplicate_expires_at = None;
                for i in 0..tuples.len() {
                    let t = &tuples[i];
                    if t.0 == tuple.0 && t.1 == tuple.1 {
                        duplicate_expires_at = Some(t.2);
                        tuples.swap_remove(i);
                        break;
                    }
                }

                tuples.push(tuple);

                if let Some(dup_expiry) = duplicate_expires_at {
                    entry.size -= 1;
                    self.current_size -= 1;

                    if dup_expiry == entry.next_expiry {
                        let mut new_next_expiry = expiry;
                        for (_, _, e) in tuples {
                            if *e < new_next_expiry {
                                new_next_expiry = *e;
                            }
                        }
                        entry.next_expiry = new_next_expiry;
                        self.expiry_priority
                            .change_priority(&record.name, Reverse(entry.next_expiry));
                    }
                }
            } else {
                entry.records.insert(rtype, vec![tuple]);
            }
            entry.last_read = now;
            entry.size += 1;
            self.access_priority
                .change_priority(&record.name, Reverse(entry.last_read));
            if expiry < entry.next_expiry {
                entry.next_expiry = expiry;
                self.expiry_priority
                    .change_priority(&record.name, Reverse(entry.next_expiry));
            }
        } else {
            let mut records = HashMap::new();
            records.insert(rtype, vec![tuple]);
            let entry = CachedDomainRecords {
                last_read: now,
                next_expiry: expiry,
                size: 1,
                records,
            };
            self.access_priority
                .push(record.name.clone(), Reverse(entry.last_read));
            self.expiry_priority
                .push(record.name.clone(), Reverse(entry.next_expiry));
            self.entries.insert(record.name.clone(), entry);
        }

        self.current_size += 1;
    }

I didn’t write this all in one go and get it right the first time. I first implemented this without the duplicate handling then, when it was working, I made it prevent duplicate records.

If you allow duplicates, the if let Some(tuples) block becomes much simpler:

if let Some(tuples) = entry.records.get_mut(&rtype) {
    tuples.push(tuple);
} else {
    entry.records.insert(rtype, vec![tuple]);
}

We’ve made it—the end of the operations!

Testing

This code is pretty involved, and I’ve already said that I made at least one mistake when first writing it. So how do I know it’s correct?

Tests.

Tests, tests, tests.

I’m not going to go into the actual test code (see the source if you want that), but I will outline the cases.

The most important thing is to have a good way to generate inputs: you want distinct domains, overlapping domains, distinct types, overlapping types, overlapping but unequal records… the whole shebang. I’m generating random records, rather than trying to enumerate all the useful cases. I’m a big fan of random inputs in testing in general.

Some say “oh, but if my test is randomised it’ll be flaky: it might pass some times and fail other times!” In which case… good? If your test fails, you’ve found a bug: fix it!

Anyway, here are my test cases:

Insert a record and then check I can retrieve it:
- With QueryType::Record(_) and QueryClass::Record(_)
- With QueryType::Wildcard and QueryClass::Record(_)
- With QueryType::Record(_) and QueryClass::Wildcard
- With QueryType::Wildcard and QueryClass::Wildcard
Insert the same record twice and check the current_size only goes up by 1, and that the invariants hold.
Insert 100 random records and check that the invariants hold.
Insert 100 random records, check that they can all be retrieved, and that the invariants hold.
Insert 100 random records into a cache with a desired_size of 25, call prune, and check that 25 records remain and that the invariants hold.
Insert 100 random records, 49 of which have a TTL of 0, call remove_expired, and check that 51 remain and that the invariants hold.
Insert 100 random records into a cache with a desired_size of 99, 49 of which have a TTL of 0, call prune, and check that 51 remain and that the invariants hold.

In most of those tests I check that the data structure invariants hold, there I:

Check that the current_size is equal to the total number of records.
Check that the entries and the access_priority are the same size.
Check that the entries and the expiry_priority are the same size.
Check the next_expiry for each domain is equal to the minimum of its records’ expiry times.
Build a new access_priority from the domains and check it’s the same as the stored one.
Build a new expiry_priority from the domains and check it’s the same as the stored one.

I feel pretty confident that my tests cover a variety of different cases and sequences of operations, and that I would have found any significant bugs. There could always be subtle bugs lurking, but that’s true of all code.

Periodic pruning

I’ve opted to prune the cache in two places.

Firstly, in my actual code, this cache is inside an Arc>, so it can be shared across threads. There’s not much point in having an unshared cache, after all. Anyway, this wrapper has some helper methods to get and insert entries, and the get helper calls remove_expired if it fetches any expired records:

impl SharedCache {
    pub fn get(
        &self,
        name: &DomainName,
        qtype: &QueryType,
        qclass: &QueryClass,
    ) -> Vec {
        let mut rrs = self.get_without_checking_expiration(name, qtype, qclass);
        let len = rrs.len();
        rrs.retain(|rr| rr.ttl > Duration::ZERO);
        if rrs.len() != len {
            self.remove_expired();
        }
        rrs
    }

    // ... more omitted
}

Secondly, I spawn a tokio task to periodically remove expired entries, and then do additional pruning if need be:

async fn prune_cache_task(cache: SharedCache) {
    loop {
        sleep(Duration::from_secs(60 * 5)).await;

        let expired = cache.remove_expired();
        let pruned = cache.prune();

        println!(
            "[CACHE] expired {:?} and pruned {:?} entries",
            expired, pruned
        );
    }
}

It was very satisfying when I added this and first saw that [CACHE] output with non-zero expired and pruned records.

What Next?

This cache works, and it works well. I get nice and fast responses from my DNS server for queries which are wholly or partially cached, and the benchmarks I’ve written look promising:

insert/unique/1         time:   [1.0965 us 1.1001 us 1.1044 us]
                        thrpt:  [905.51 Kelem/s 909.00 Kelem/s 912.01 Kelem/s]
insert/unique/100       time:   [115.72 us 115.96 us 116.24 us]
                        thrpt:  [860.27 Kelem/s 862.39 Kelem/s 864.15 Kelem/s]
insert/unique/1000      time:   [1.1769 ms 1.1787 ms 1.1807 ms]
                        thrpt:  [846.96 Kelem/s 848.36 Kelem/s 849.67 Kelem/s]

insert/duplicate/1      time:   [1.1927 us 1.1964 us 1.2003 us]
                        thrpt:  [833.13 Kelem/s 835.86 Kelem/s 838.44 Kelem/s]
insert/duplicate/100    time:   [56.880 us 57.047 us 57.221 us]
                        thrpt:  [1.7476 Melem/s 1.7529 Melem/s 1.7581 Melem/s]
insert/duplicate/1000   time:   [541.33 us 542.10 us 542.93 us]
                        thrpt:  [1.8419 Melem/s 1.8447 Melem/s 1.8473 Melem/s]

get_without_checking_expiration/hit/1
                        time:   [1.4057 us 1.4249 us 1.4425 us]
                        thrpt:  [693.22 Kelem/s 701.81 Kelem/s 711.40 Kelem/s]
get_without_checking_expiration/hit/100
                        time:   [84.651 us 84.999 us 85.322 us]
                        thrpt:  [1.1720 Melem/s 1.1765 Melem/s 1.1813 Melem/s]
get_without_checking_expiration/hit/1000
                        time:   [991.64 us 997.89 us 1.0030 ms]
                        thrpt:  [996.98 Kelem/s 1.0021 Melem/s 1.0084 Melem/s]

get_without_checking_expiration/miss/1
                        time:   [948.17 ns 961.92 ns 974.39 ns]
                        thrpt:  [1.0263 Melem/s 1.0396 Melem/s 1.0547 Melem/s]
get_without_checking_expiration/miss/100
                        time:   [45.399 us 46.116 us 46.671 us]
                        thrpt:  [2.1426 Melem/s 2.1684 Melem/s 2.2027 Melem/s]
get_without_checking_expiration/miss/1000
                        time:   [570.42 us 577.92 us 583.75 us]
                        thrpt:  [1.7131 Melem/s 1.7303 Melem/s 1.7531 Melem/s]

remove_expired/1        time:   [1.2796 us 1.2983 us 1.3151 us]
                        thrpt:  [760.38 Kelem/s 770.26 Kelem/s 781.52 Kelem/s]
remove_expired/100      time:   [55.622 us 56.761 us 57.895 us]
                        thrpt:  [1.7273 Melem/s 1.7618 Melem/s 1.7978 Melem/s]
remove_expired/1000     time:   [786.47 us 794.30 us 800.89 us]
                        thrpt:  [1.2486 Melem/s 1.2590 Melem/s 1.2715 Melem/s]

prune/1                 time:   [1.3455 us 1.3539 us 1.3617 us]
                        thrpt:  [734.36 Kelem/s 738.63 Kelem/s 743.24 Kelem/s]
prune/100               time:   [41.584 us 41.676 us 41.774 us]
                        thrpt:  [2.3938 Melem/s 2.3995 Melem/s 2.4048 Melem/s]
prune/1000              time:   [613.73 us 617.63 us 620.87 us]
                        thrpt:  [1.6106 Melem/s 1.6191 Melem/s 1.6294 Melem/s]

But could it be better?

The only optimisation that really comes to mind is using a trie instead of the HashMap for domains. Another possibility is turning it into a more generic size-bounded-LRU-cache-with-expiration data structure with type parameters, and so making the DNS usage just a specialisation of that; perhaps genericising the code would make it easier to see improvements.

But nothing needs to be done, it works pretty well as it is. When I start using my DNS server for my LAN, and it starts to get much more traffic than my test instance, I’m sure performance problems will start to crop up, but hopefully they won’t be with this cache.

Not just “perhaps”: this is more-or-less copied straight from my original code.↩︎
Least Recently Used↩︎
From the priority-queue crate. I started out trying to build something on top of std::collections::BinaryHeap directly, but didn’t get very far.↩︎

Continuous Integration and Continuous Deployment

2021-03-20T00:00:00Z

Once upon a time I used a self-hosted instance of Jenkins and the free-for-open-source Travis CI for continuous integration (CI) and continuous deployment (CD). It worked, but had some undesirable traits:

There wasn’t any rhyme or reason over what ended up where.
Travis often took a long time to run jobs.
Jenkins was almost all hand-configured, with little config in version control.

I’m a big fan of configuration-as-code, and when I was exposed to Concourse CI at work, which does everything through configuration files and environment variables, I decided to replace my Jenkins set-up and migrate some of my Travis projects as a learning experience.

Eventually I ended up with Concourse doing continuous deployment, and Travis solely for continuous integration. This worked well, until the future of the free-for-open-source Travis became uncertain, and I decided to move away.

As luck would have it, we were discussing using GitHub Actions for CI at work at the time. I decided to switch to Actions as another learning experience.

Now I have GitHub Actions for CI on pull requests (PRs), and Concourse for CD of master branches. It works pretty well.

This memo talks through my practices, using this blog and dejafu as running examples. I’ll also cover how I run Concourse on NixOS, other related tools I use, and what my plans for future work are.

GitHub Actions

GitHub Actions is GitHub’s hosted CI/CD tool. It’s got good support for both official and community-maintained Actions (which are Docker images conforming to a simple specification), is as well-integrated into the rest of GitHub as you’d expect, and has a config file syntax not entirely unlike Travis.

Currently I’m inconsistent across my repos whether I require Actions to pass before a commit can make it into master. I tend to have that for my Haskell packages, because master gets deployed to Hackage, but allow pushing straight to master for other things.

Example: memo.barrucadu.co.uk

See the configuration file.

This is fairly typical of my Python projects: I have two jobs, which show up as two separate checks with their own logs in a PR, one to check for linting errors and one to check that the dependencies all install.

I’ve found that pip doesn’t have the most robust dependency solver, and can sometimes get confused and install mutually incompatible versions of packages. So for any PR which upgrades the dependencies, I like to ensure that the freeze file has a consistent set of versions.

If I wrote tests they would solve this problem too. But I don’t.

Example: dejafu

See the configuration file.

This is rather more complicated. I want to build the code and run the tests against all the supported versions of GHC, but for linting and doctests I just want to use the latest version. And I want the linting, doctests, and each of the main tests to run as separate jobs. This makes them run in parallel, and means that a failure in one doesn’t prevent the rest from running.

Like Travis, GitHub Actions supports matrix builds. The strategy part of the configuration means “run this job with each of these options; and don’t kill the rest if one fails”:

strategy:
  fail-fast: false
  matrix:
    resolver:
      - lts-9.0 # ghc-8.0
      - lts-10.0 # ghc-8.2
      - lts-12.0 # ghc-8.4
      - lts-13.3 # ghc-8.6
      - lts-15.0 # ghc-8.8
      - lts-17.0 # ghc-8.10

Another nice feature of GitHub Actions is that the documentation is well-written and easy to follow. Just about every option has a short example.

Concourse CI

Concourse CI is an opinionated “continuous thing-doer”. Everything is containerised and pure. No state is shared between jobs without you explicitly managing it, in the form of a “resource” (like a git remote, or an S3 bucket).

This was a big change when I came from Jenkins, which is just about as impure as you can get, but I’ve become a big fan of it. It makes jobs (potentially) reproducible, as they only depend on their inputs and on the pipeline configuration. You can have nondeterminism in your configuration, but you can’t get into trouble because of a previous build leaving things in a weird state.

I currently have 16 Concourse pipelines deploying a variety of things:

My Haskell packages (by uploading a package to Hackage)
My bookdb and bookmarks (by uploading a Docker image to my registry, and SSHing into a server to restart a systemd unit)
A bunch of static websites
My AWS and DNS configuration (these jobs automatically plan, but don’t apply until I click a button)

Example: memo.barrucadu.co.uk

See the configuration file.

This is another fairly typical pipeline, all of my static websites look largely like this. The one unusual feature is that it builds a Docker image: I need a few dependencies to deploy this site, like pandoc, so rather than install them on every deploy I build an image.

The deploy uses a custom rsync-resource that I took from somewhere and slightly tweaked. It also uses ((secrets)) in a few places.

The configuration is rather more verbose than GitHub Actions. It is doing more, but it also requires more to be spelled out. This can make large pipelines a bit difficult to read.

Example: dejafu

See the configuration file.

This is significantly more complicated. dejafu is a monorepo containing four Haskell packages and one set of tests, so this pipeline has jobs for testing & releasing each of those packages, as well as a job to run a nightly build when Stackage updates.

I use YAML anchors to reduce the repetition, which helps a bit, but it’s still a pretty long file.

This pipeline shows off Concourse’s task dependencies. All builds are triggered by a “resource” changing, but a job can specify that it should only be called for resources which passed a previous job.

For example, the release-concurrency job will be triggered by changes to the concurrency-cabal-git resource, but only after they pass the test-concurrency job:

- name: test-concurrency
  plan:
    - get: concurrency-cabal-git
      trigger: true
    - task: build-and-test
      input_mapping:
        source-git: concurrency-cabal-git
      config:
        <<: *task-build-and-test

- name: release-concurrency
  plan:
    - get: concurrency-cabal-git
      trigger: true
      passed:
        - test-concurrency
    - task: prerelease-check
      params:
        PACKAGE: concurrency
      input_mapping:
        source-git: concurrency-cabal-git
      config:
        <<: *task-prerelease-check
    - task: release
      params:
        PACKAGE: concurrency
      input_mapping:
        source-git: concurrency-cabal-git
      config:
        <<: *task-release

These dependencies are what make up the visualisation in the screenshot above.

Other tools: Dependabot

Dependabot is a handy little tool for automatically checking if you have any outdated dependencies, for a variety of ecosystems, and opening a PR to update them. It’s another tool we use at work (spotting a pattern?), but I didn’t pick this up to learn anything: it’s so simple there’s nothing really to learn, and its utility far outweighs the small configuration file you might want to write.

Example: memo.barrucadu.co.uk

See the configuration file.

This is one of my more complex Dependabot config files, which should hopefully convince you of how straightforward it is. It specifies I want PRs to update any official or community Actions, Dockerfile base images, or pip dependencies, that I’m using. And I want it to check daily (at 5AM UTC by default).

That’s it!

Example: dejafu

See the configuration file.

Unlike the other cases, this time dejafu has a simpler configuration than the blog. Dependabot doesn’t support Haskell, so all it’s doing is ensuring any Actions I’m using are kept up to date.

Since my Haskell packages are on Stackage, the Stackage maintainers let me know if I need to update a dependency.

Secrets Management

I don’t make a practice of needing secrets to build or run code in my public repos, so I don’t need to give GitHub Actions any secrets. It’s supported though, you can have both organisation-level and repository-level secrets.

My Concourse pipelines, however, do regularly need secrets. The password for my private Docker registry; the password to upload Haskell packages to Hackage; the SSH key to deploy this blog; and more!

Concourse has support for a few secret stores. I’m using the AWS SSM integration, mostly because it’s incredibly cheap, and means I don’t have to host and secure anything myself. It works well, I just need to set some environment variables giving Concourse an AWS access key hooked up to an IP-restricted policy granting SSM and KMS permissions. Almost no effort at all to set up if you already have an AWS account.

Running Concourse CI on NixOS

NixOS is my Linux distribution of choice and, while it has packages for many things, it does not have one for Concourse. However, there is an official docker image for Concourse.

I’ve got a systemd unit running Concourse in docker-compose:

systemd.services.concourse =
  let
    yaml = import ./concourse.docker-compose.nix {
      httpPort = concourseHttpPort;
      githubClientId     = fileContents /etc/nixos/secrets/concourse-clientid.txt;
      githubClientSecret = fileContents /etc/nixos/secrets/concourse-clientsecret.txt;
      enableSSM = true;
      ssmAccessKey = fileContents /etc/nixos/secrets/concourse-ssm-access-key.txt;
      ssmSecretKey = fileContents /etc/nixos/secrets/concourse-ssm-secret-key.txt;
    };
    dockerComposeFile = pkgs.writeText "docker-compose.yml" yaml;
  in
    {
    enable = true;
    wantedBy = [ "multi-user.target" ];
    requires = [ "docker.service" ];
    environment = { COMPOSE_PROJECT_NAME = "concourse"; };
    serviceConfig = {
      ExecStart = "${pkgs.docker_compose}/bin/docker-compose -f '${dockerComposeFile}' up";
      ExecStop  = "${pkgs.docker_compose}/bin/docker-compose -f '${dockerComposeFile}' stop";
      Restart   = "always";
    };
  };

Where the concourse.docker-compose.nix file is just some templated YAML. I’ve heard that you shouldn’t use systemd units to run Docker containers, for some reason, but it works and I run a few different services on a bunch of servers like this. Running Concourse in Docker also makes it easy to upgrade to a newer version, without needing to wait for an official package to be updated.

Future Work

I’m pretty happy with how things are working right now. Until recently I didn’t have Concourse secrets set up, and I was handling secrets by doing variable interpolation in my pipeline deployment script, and also I’d written everything in jsonnet for some reason. Setting up secrets, just using YAML, and removing the deployment script simplified things a lot.

I see GitHub advertising code scanning to me in all of my repositories, so maybe I’ll look into that next. I’m a big fan of static analysis, so having something which automatically scans my code for issues is very attractive.

The main thing I don’t have continuous deployment for is my NixOS configuration. I SSH into servers, run git pull && sudo nixos-rebuild switch like some sort of caveman! But automatically deploying that makes me a bit nervous, what if it goes wrong? Still, I switched to automatic updates recently, and nothing has broken yet, so maybe automatic configuration deployments are fine too.

At home for one year

2021-03-19T00:00:00Z

The 19th of March, 2020 was the last time I visited the office, and there were only a couple of other people in.

Lockdowns have come and gone, restrictions have changed frequently and unexpectedly, and so I’ve lived the last 12 months as a hermit. Since that final day in the office, one year ago today, I’ve only left my flat once or twice a week, and that only to go shopping.

There is a vaccine now but, judging from the timeline, I’ll still be at home for a few more months.

The Good and the Bad

It feels a bit selfish to type this, but frankly I’ve been having a great time:

My sleep has improved. The lack of commute means I get an extra hour or so to lie in bed.
I’ve saved money. Partly due to the lack of commute, but also due to not going out to buy lunch. Even one or two lunches a week add up.
I’ve been reading more. I now have more energy in the evenings after the work day ends, so I’ve got back into the habit of reading before bed. And over 2020 I read 99 books.
I can cook whenever I want. I used to get hungry in the afternoons almost every day. One day a thought hit me: if I’m at home all the time now, I can cook a proper meal for lunch! And so I switched to having my main meal of the day for lunch, and a smaller meal in the evening.
I’ve started a second RPG group. I did get a bit bored after a couple of months, and so I reached out to some online friends to see if anyone wanted to play games. I’ve now got a group which has been going strong since May, and I’ve deepened those friendships.
I’m not in an open office any more. I don’t like open office layouts. I always feel like someone is peering over my shoulder and watching my screen. It’s not an issue with just my current job, it’s been an issue everywhere. At home, I know there is nobody watching me, and I feel much more relaxed, even when I’m not slacking off.

Of course, there have been a handful of downsides too:

I’ve not seen any friends. I’ve got a small group of friends who meet up a couple of times a year, and we’ve missed a few of those meetings. We’ve made do with Zoom calls, but it’s not the same.
I’ve not seen any family. I normally only visit home at Christmas, and Christmas got cancelled.
I came down with shingles. Not very fun, possibly caused by stress. I’ve got a few small scars on my forehead which, now that it’s been nearly 6 months, will likely not heal. However, other than that one week of illness, my health has been great.

But the upsides definitely outweigh these. I was already only physically meeting friends and family three or four times a year, so missing one year isn’t a huge change. It’s not like I’ve gone from hanging out with people at the pub every week to never seeing anyone.

…and the Strange

The weirdest part of the past year, by far, has been the discovery that a significant number of people just cannot cope with being alone, and break down after spending even a fortnight by themselves.

I was regularly spending weeks by myself even before covid!

It makes some sense though. I fill my time with reading, programming, playing RPGs, and socialising with online friends. Most people don’t do any of those to any significant degree (or at all). If everything you do for fun requires the physical presence of other people, the past year will have been tough.

I also have appropriate desk space, and don’t have noisy children or housemates. Being a loner with a nice flat during lockdown is life in easy mode.

I’m sure I will have to return to the office at some point, but I’ll fully enjoy being at home until then.

Quick Code Improvements

2021-03-05T00:00:00Z

Here are 21 small improvements you can make to your code or the tooling around it, taken from the Code Quality Challenge in February 2021. If you find yourself with 20 minutes spare, pick one and see how far you can get.

Improve your README

For example, document the philosophy behind your project and how it fits into the larger ecosystem; give a comparison to similar projects; give usage examples; explain how it’s developed and tested; how it’s deployed (if it’s a program); and your approach to outside contributions.
Nuke TODO comments

Grep for TODO and: if out of date, delete; if still relevant, fix or turn into an issue; and if you’re unsure, find someone who is sure.
Get rid of a warning

Whether it’s in the code proper or just in the tests, fix at least one.
Delete some unused code

Tools like unused or test coverage metrics can help you track down dead code.
Trim your (git) branches

Run git remote prune origin to delete any tracking branches which have since been merged or deleted. If you have any old branches of your own, get rid of them with git push origin --delete .
Extract a compound conditional

Look for complex conditionals of multiple terms and see if they can be extracted into a function or a variable whose name clearly expresses what is being checked.
Slim down an overgrown class

Look at your largest classes (or modules if you’re using a class-less language) and see if there are any bits of code which can be refactored. Extract a new class (or module), delete a stray comment, improve a name, tighten the visibility of a method (or function, type, etc), split apart a long method, and so on.
Help new starters get up to speed

The actual challenge was to create a setup script, but you might have a different approach to solving this problem. So create a setup script, or a Dockerfile, add instructions to your README, or however you do it.
Run your tests with no network connection

Tests which rely on an external service are slow and brittle, so try to get your tests passing without any such dependencies.
Investigate your slowest tests

Find your 10 slowest tests or so and have a look through them. Are any duplicates? Can any be replaced by a faster variant? Are they actually useful?
Improve one name

Find one poorly-named thing and make it better. Any thing.
Audit your dependencies

Are they still needed? Is everything up to date? Can a runtime dependency be turned into a build or test dependency?
Audit your PRs and issues

Have any been hanging around for a while? If so, are they still relevant? If you’re not sure, ask the reporter if they can confirm, and close the issue if they don’t get back to you in a week or so.
Investigate long parameter lists

Long parameter lists, particularly if they occur in multiple methods (or functions), might indicate that there’s a useful type you’re missing, or that some of the parameters should be instance data. Some parameters, like booleans, may indicate that you’ve got one method doing the work of several, and it should be split up.
Automate something repetitive

Find something you do repeatedly and automate it. For example, write shell aliases for some commands you run a lot.
Audit your database schema

You might look for inconsistent column names, missing indices, or missing null or foreign key constraints.
RTFM

Look at the docs for something you use a lot—whether that’s a development tool (like your text editor), or a backing service (like a database), or a framework, or something else—and see if there’s anything which you can apply.

Investigate high-churn files

Files which change a lot can point to a good refactoring opportunity. With git you can see the number of commits each file has with:

git log --all -M -C --name-only --format='format:'  \
    | sort \
    | grep -v '^$' \
    | uniq -c \
    | sort -n \
    | awk 'BEGIN {print "count\tfile"} {print $1 "\t" $2}'

Create or update your snippets

If your text editor has support for snippets, make sure you have some for any code patterns you type a lot.
Begin plugging a knowledge gap

There’s probably something you know you don’t know. Start doing something about it: spend 20 minutes researching it and start to chip away at your lack of knowledge.
Extract a method

Look at your larger methods (or functions), are there any groups of functionality which could be pulled out into smaller units of code with their own clear names?

Indoor Air Quality

2021-02-06T00:00:00Z

I strongly suspect my thermostat is lying to me.

Some days I will be shivering, and it says the temperature is 28C.

Some days I will be sweating, and it says the temperature is 18C.

It’s as if it’s measuring the temperature of somewhere else, but the thermostat is in my living room.

So to put the issue to rest, I wanted to get a smart ambient thermometer, to compare measurements. Ideally something with an API which I can use to get the data into Prometheus and graph it.

This is the Awair Element, an indoor air quality monitoring smart device, which measures a bunch of things—including temperature.¹ I’ve got one in my living room, and I plan to get one for my bedroom.

It has an API, so I can scrape the data:

$ curl http://10.0.20.117/air-data/latest | json_pp

{
   "abs_humid" : 9.06,
   "co2" : 799,
   "co2_est" : 693,
   "dew_point" : 10.05,
   "humid" : 49.91,
   "pm10_est" : 4,
   "pm25" : 3,
   "score" : 90,
   "temp" : 20.89,
   "timestamp" : "2021-02-06T20:39:13.338Z",
   "voc" : 422,
   "voc_baseline" : 2562694386,
   "voc_ethanol_raw" : 38,
   "voc_h2_raw" : 27
}

And stick it on a dashboard; here’s my Saturday night gaming session:

It’s February now, so it’s cold. I had all the windows and my living room door shut from a little before 16:00. You can see the CO2 and VOC levels creeping up.

We had a 15-minute break in the middle. I opened the windows and door. You can see the levels drop back down. And then creep up again after the break.

The percentage in the top-left is an overall score based on the other metrics. 80%+ means your air is great, I’ve been aiming to keep it above 90%. The thresholds on the other graphs are based on the thresholds the Awair Element uses: it goes from 1 to 5 dots, which I’ve condensed into three sets of regions (ideal, good, bad).

I find myself glancing at the device (and the dashboard) throughout the day, and opening a window if it looks like I could do with a bit more ventilation.

Even if it’s making the numbers up² it’s making me get more fresh air, which can only be a good thing.

And yes, my thermostat is lying to me.

Here’s a great video on why you should care about the quality of your air.↩︎
Though I hope it’s not, and the movements on the graphs do correlate with when I have windows open.↩︎

Benchmarking WSGI servers

2020-12-23T00:00:00Z

I’ve been using flask’s built in WSGI server for bookdb and bookmarks for a while now. The very same built in server that it warns you to not use in production because it scales badly.

But how badly? Fortunately, the flask docs list some better servers, so I decided to try out a few of them.

Testing methodology

I decided to use siege, because it can take a list of URLs in a text file. I’ve got some prior experience of Gatling, but didn’t feel like writing Scala.

I produced a list of 30 bookdb URLs:

2 variations of the search page with no parameters (both HTML and JSON endpoints)
7 variations of the search page with parameters (all HTML)
1 book JSON endpoint
9 book cover images
9 book cover thumbnail images
2 static files (css and javascript)

And then I ran siege for 10s with 2, 4, and 8 workers, against:

The default Werkzeug WSGI server
Gunicorn, with 4 processes
uWSGI, with 4 processes
Gevent

Results

The results are in, the default Werkzeug server is bad at scaling! The number of transactions (completed requests) per second doesn’t really change, even when the number of siege workers (clients) goes up by a factor of 4. I suspect it’s processing requests synchronously in a single thread.

Every other server shows a good increase in throughput when the number of clients goes up. Though Gevent starts even slower than Werkzeug!

Gunicorn looks like a slight winner over uWSGI, so that’s the server I’ll be using going forwards.

Appendix: raw data

Werkzeug

+ siege -q -t 10S -c 2 -f urls.txt

{       "transactions":                         2309,
        "availability":                       100.00,
        "elapsed_time":                        10.00,
        "data_transferred":                     8.21,
        "response_time":                        0.01,
        "transaction_rate":                   230.90,
        "throughput":                           0.82,
        "concurrency":                          1.98,
        "successful_transactions":              2309,
        "failed_transactions":                     0,
        "longest_transaction":                  0.61,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 4 -f urls.txt

{       "transactions":                         2648,
        "availability":                       100.00,
        "elapsed_time":                         9.99,
        "data_transferred":                     7.80,
        "response_time":                        0.01,
        "transaction_rate":                   265.07,
        "throughput":                           0.78,
        "concurrency":                          3.96,
        "successful_transactions":              2648,
        "failed_transactions":                     0,
        "longest_transaction":                  0.87,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 8 -f urls.txt

{       "transactions":                         2503,
        "availability":                       100.00,
        "elapsed_time":                         9.98,
        "data_transferred":                    11.85,
        "response_time":                        0.03,
        "transaction_rate":                   250.80,
        "throughput":                           1.19,
        "concurrency":                          7.96,
        "successful_transactions":              2503,
        "failed_transactions":                     0,
        "longest_transaction":                  0.89,
        "shortest_transaction":                 0.01
}

Gunicorn

+ siege -q -t 10S -c 2 -f urls.txt

{       "transactions":                         2833,
        "availability":                       100.00,
        "elapsed_time":                         9.11,
        "data_transferred":                     9.21,
        "response_time":                        0.01,
        "transaction_rate":                   310.98,
        "throughput":                           1.01,
        "concurrency":                          1.95,
        "successful_transactions":              2833,
        "failed_transactions":                     0,
        "longest_transaction":                  0.62,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 4 -f urls.txt

{       "transactions":                         4175,
        "availability":                       100.00,
        "elapsed_time":                         9.98,
        "data_transferred":                    16.54,
        "response_time":                        0.01,
        "transaction_rate":                   418.34,
        "throughput":                           1.66,
        "concurrency":                          3.94,
        "successful_transactions":              4175,
        "failed_transactions":                     0,
        "longest_transaction":                  1.24,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 8 -f urls.txt

{       "transactions":                         5665,
        "availability":                       100.00,
        "elapsed_time":                         9.98,
        "data_transferred":                    18.92,
        "response_time":                        0.01,
        "transaction_rate":                   567.64,
        "throughput":                           1.90,
        "concurrency":                          7.86,
        "successful_transactions":              5665,
        "failed_transactions":                     0,
        "longest_transaction":                  1.54,
        "shortest_transaction":                 0.00
}

uWSGI

+ siege -q -t 10S -c 2 -f urls.txt

{       "transactions":                         2875,
        "availability":                       100.00,
        "elapsed_time":                         9.86,
        "data_transferred":                     9.46,
        "response_time":                        0.01,
        "transaction_rate":                   291.58,
        "throughput":                           0.96,
        "concurrency":                          1.97,
        "successful_transactions":              2875,
        "failed_transactions":                     0,
        "longest_transaction":                  0.58,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 4 -f urls.txt

{       "transactions":                         3983,
        "availability":                       100.00,
        "elapsed_time":                         9.98,
        "data_transferred":                    16.38,
        "response_time":                        0.01,
        "transaction_rate":                   399.10,
        "throughput":                           1.64,
        "concurrency":                          3.94,
        "successful_transactions":              3983,
        "failed_transactions":                     0,
        "longest_transaction":                  1.03,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 8 -f urls.txt

{       "transactions":                         5394,
        "availability":                       100.00,
        "elapsed_time":                         9.98,
        "data_transferred":                    16.36,
        "response_time":                        0.01,
        "transaction_rate":                   540.48,
        "throughput":                           1.64,
        "concurrency":                          7.91,
        "successful_transactions":              5394,
        "failed_transactions":                     0,
        "longest_transaction":                  1.32,
        "shortest_transaction":                 0.00
}

Gevent

+ siege -q -t 10S -c 2 -f urls.txt

{       "transactions":                         2076,
        "availability":                       100.00,
        "elapsed_time":                         9.70,
        "data_transferred":                     7.91,
        "response_time":                        0.01,
        "transaction_rate":                   214.02,
        "throughput":                           0.82,
        "concurrency":                          1.97,
        "successful_transactions":              2076,
        "failed_transactions":                     0,
        "longest_transaction":                  0.65,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 4 -f urls.txt

{       "transactions":                         2796,
        "availability":                       100.00,
        "elapsed_time":                         9.98,
        "data_transferred":                     8.97,
        "response_time":                        0.01,
        "transaction_rate":                   280.16,
        "throughput":                           0.90,
        "concurrency":                          3.96,
        "successful_transactions":              2796,
        "failed_transactions":                     0,
        "longest_transaction":                  0.63,
        "shortest_transaction":                 0.00
}
+ siege -q -t 10S -c 8 -f urls.txt

{       "transactions":                         3143,
        "availability":                       100.00,
        "elapsed_time":                         9.99,
        "data_transferred":                    12.68,
        "response_time":                        0.03,
        "transaction_rate":                   314.61,
        "throughput":                           1.27,
        "concurrency":                          7.95,
        "successful_transactions":              3143,
        "failed_transactions":                     0,
        "longest_transaction":                  0.59,
        "shortest_transaction":                 0.01
}

Appendix: urls.txt

http://127.0.0.1:3000/search
http://127.0.0.1:3000/search?keywords=flatland&author%5B%5D=&location=&match=&category=
http://127.0.0.1:3000/search?keywords=flatland&author%5B%5D=Ian+Stewart&location=&match=&category=
http://127.0.0.1:3000/search?keywords=&author%5B%5D=&location=f256ed66-4c09-4207-86de-adc8e9fb86ec&match=&category=
http://127.0.0.1:3000/search?keywords=&author%5B%5D=&location=f256ed66-4c09-4207-86de-adc8e9fb86ec&match=only-unread&category=
http://127.0.0.1:3000/search?keywords=Before+Dawn&author%5B%5D=&location=f256ed66-4c09-4207-86de-adc8e9fb86ec&match=only-unread&category=
http://127.0.0.1:3000/search?keywords=Before+AND+Dawn&author%5B%5D=&location=f256ed66-4c09-4207-86de-adc8e9fb86ec&match=only-unread&category=
http://127.0.0.1:3000/search?keywords=&author%5B%5D=Zzarchov+Kowolski&location=&match=only-read&category=70196ec9-dd61-4241-afc9-dd6be7be30a6
http://127.0.0.1:3000/search.json
http://127.0.0.1:3000/book/9780486272634
http://127.0.0.1:3000/book/9780486272634/cover
http://127.0.0.1:3000/book/9780262510875/cover
http://127.0.0.1:3000/book/9780575082014/cover
http://127.0.0.1:3000/book/9780575079793/cover
http://127.0.0.1:3000/book/9780141397726/cover
http://127.0.0.1:3000/book/9780575086159/cover
http://127.0.0.1:3000/book/9780199535644/cover
http://127.0.0.1:3000/book/9780575077324/cover
http://127.0.0.1:3000/book/9781421578798/cover
http://127.0.0.1:3000/book/9780486272634/thumb
http://127.0.0.1:3000/book/9780262510875/thumb
http://127.0.0.1:3000/book/9780575082014/thumb
http://127.0.0.1:3000/book/9780575079793/thumb
http://127.0.0.1:3000/book/9780141397726/thumb
http://127.0.0.1:3000/book/9780575086159/thumb
http://127.0.0.1:3000/book/9780199535644/thumb
http://127.0.0.1:3000/book/9780575077324/thumb
http://127.0.0.1:3000/book/9781421578798/thumb
http://127.0.0.1:3000/static/style.css
http://127.0.0.1:3000/static/script.js

Appendix: graph script

#! /usr/bin/env nix-shell
#! nix-shell -i python -p "python3.withPackages (ps: [ps.matplotlib ps.numpy])"

import matplotlib.pyplot as plt
import numpy as np

plt.xkcd()
plt.figure(figsize=(12,6))

labels = ["Werkzeug", "Gunicorn", "uWSGI", "Gevent"]
bars = [("2 workers", [230.90, 310.98, 291.58, 214.02]),
        ("4 workers", [265.07, 418.34, 399.10, 280.16]),
        ("8 workers", [250.80, 567.64, 540.48, 314.61])]

bar_width = 0.25

rs = [np.arange(len(labels))]
for i in range(len(bars)-1):
    rs.append([x + bar_width for x in rs[-1]])

for i in range(len(bars)):
    plt.bar(rs[i], bars[i][1], width=bar_width, label=bars[i][0])

plt.ylabel("Transactions per second (higher is better)")
plt.xlabel("Server")
plt.xticks([r + bar_width for r in range(len(labels))], labels)

plt.legend()
plt.savefig("transaction-rate.png")

Migrate GOV.UK to Puma

2020-12-23T00:00:00Z

Mere hours after going on leave for the festive period, I’ve got back in the mood to do complicated tech things for fun, and the topic which came to mind is “how hard would it be to migrate GOV.UK from Unicorn (old and busted) to Puma (new hotness)?”

Not only does Puma have a much prettier website, it’s also the Rails default web server (and has been for a while). So the wider ecosystem has decided it’s a better server. Furthermore, Puma potentially solves an awkward problem we have with Unicorn: memory usage.

Unicorn runs multiple worker processes, which can each take up quite a bit of RAM. It adds up quickly if you have multiple apps running on the same server. If a process is IO bound rather than CPU bound, this means scaling is more awkward, we either have to bring up new servers, or embiggen our current ones.

Puma, on the other hand, runs multiple threads within each worker process. Threads can be very lightweight, sharing almost all of their memory. So we can pack far more threads on the same server, so long as our application is not CPU bound.

I’ve done some thinking on how we could try out Puma on GOV.UK. These steps are untested, and are based on reading old puppet code and init scripts at 2AM, so follow them at your peril. But I think it would be something like this.

Configure the app

The app needs a Puma config file. Eventually we would want something shared in govuk_app_config, but if we’re trying this out with a single app to start with, a config file in the app would do.

I think we’ll want something like this:

# frozen_string_literal: true

max_threads_count = ENV.fetch("RAILS_MAX_THREADS") { 1 }
min_threads_count = ENV.fetch("RAILS_MIN_THREADS") { max_threads_count }
threads min_threads_count, max_threads_count

port ENV.fetch("PORT") { 3000 }

environment ENV.fetch("RAILS_ENV") { "development" }

workers ENV.fetch("UNICORN_WORKER_PROCESSES") { 2 }

preload_app!

Puma concurrency is threads * workers. We can run Puma in the same way as we run Unicorn–configure the number of workers, but only give each 1 thread—which will let us see the performance impact of Puma by itself. We can also set workers lower and threads higher to start to get the memory savings. There’s probably some tweaking to be done.

Add Puma support to unicornherder

Our unicornherder tool is a common abstraction over Unicorn and Gunicorn. We can add Puma support to it too:

COMMANDS = {
    'unicorn': 'unicorn -D -P "{pidfile}" {args}',
    'unicorn_rails': 'unicorn_rails -D {args}',
    'unicorn_bin': '{unicorn_bin} -D -P "{pidfile}" {args}',
    'gunicorn': 'gunicorn -D -p "{pidfile}" {args}',
    'gunicorn_django': 'gunicorn_django -D -p "{pidfile}" {args}',
    'gunicorn_bin': '{gunicorn_bin} -D -p "{pidfile}" {args}'
    'puma': 'pumactl start -P "{pidfile}" {args}'
}

There’s also some logic around restarts, waiting for the old master process to terminate its workers gracefully and then kill it. I don’t think that will do anything useful under Puma, but I don’t think it’ll cause any problems either.

Note: unicornherder sends a SIGUSR2 which, for Puma, will perform something like what we call a “deploy with hard restart”, where the old processes get killed and new ones brought up. However, the puma docs describe how things are handled gracefully:

Any in-flight requests get handled before the server is shut down.
Any requests which start just as the server restarts will experience some latency, but will not be dropped.

Since this is a full process restart, any new Ruby version will be used, and any change to the Puma config will be applied. This means we will no longer need to do a separate hard restart for Puma apps when upgrading Ruby!

Puma also offers a phased restart approach, which restarts one worker at a time, but that doesn’t reload the Puma master process, and so won’t pick up a new Ruby version or new Puma config. It’s also incompatible with the preload_app! option.

Add Puma support to govuk_spinup

The confusing initialisation of a GOV.UK app begins in a sysvinit script, …

Which calls govuk_spinup, …

Which calls start-stop-daemon, …

Which calls unicornherder, …

Which finally calls the app server.

I think the changes needed here are to govuk_spinup. We’ll need a new app type, let’s call it “puma”:

  puma)
    status "Spawning rack app under puma"

    if [ ! -e '${GOVUK_APP_ROOT}/config/puma.rb' ]; then
      error "Missing Puma config file"
    fi

    CMD="bundle exec unicornherder -u puma -p '${GOVUK_APP_RUN}/app.pid' -- -C '${GOVUK_APP_ROOT}/config/puma.rb'"
    ;;

There’s also a govuk_unicorn_reload script, called during deploys, but I don’t think that needs to change.

Set up monitoring for Puma apps

The govuk::app::config class in govuk-puppet defines a bunch of Icinga alerts which’ll need changing, or copying, for our new “puma” app type to be as monitored as it should be.

This:

  # Set up monitoring
  if $app_type in ['rack', 'bare', 'procfile'] {
    $default_collectd_process_regex = $app_type ? {
      'rack' => "unicorn (master|worker\\[[0-9]+\\]).* -P ${govuk_app_run}/app\\.pid",
      'bare' => inline_template('<%= Regexp.escape(@command) + "$" -%>'),
      'procfile' => "gunicorn .* ${govuk_app_run}/app\\.pid",
    }

And this:

  if ($app_type == 'rack') or $monitor_unicornherder {
    @@icinga::check { "check_app_${title}_unicornherder_up_${::hostname}":
      ensure              => $ensure,
      check_command       => "check_nrpe!check_proc_running_with_arg!unicornherder /var/run/${title}/app.pid",
      service_description => "${title} app unicornherder not running",
      host_name           => $::fqdn,
      notes_url           => monitoring_docs_url(unicorn-herder),
      contact_groups      => $additional_check_contact_groups,
    }
  }

And this:

  if $app_type == 'rack' {
    include icinga::client::check_unicorn_ruby_version
    @@icinga::check { "check_app_${title}_unicorn_ruby_version_${::hostname}":
      ensure              => $ensure,
      check_command       => "check_nrpe!check_unicorn_ruby_version!${title}",
      service_description => "${title} is not running the expected ruby version",
      host_name           => $::fqdn,
      notes_url           => monitoring_docs_url(ruby-version),
      contact_groups      => $additional_check_contact_groups,
    }
  }

Change the app to a Puma app

Now that we’ve got our new app type, we need to stick app_type => 'puma' in the relevant call to govuk::app elsewhere in govuk-puppet.

And that’s it!

Finally, deploy the change

Since we’re using proper init scripts with pidfile management, I think that deploying Puppet will be a graceful change:

Puppet will trigger a restart of the app due to the change to its config and govuk_spinup.
The init script will read the existing pidfile and stop the old Unicorn process in the usual SIGINT / SIGKILL way.
The init script will start the app up with Puma via the modified govuk_spinup / unicornherder.

If not, I think the steps to deploy will be:

Pause Puppet on the affected (perhaps afflicted?) machines
Deploy Puppet
For each machine:
1. Manually stop the app
2. Unpause and run Puppet

Automatically tagging audio files (using systemd and inotify)

2020-10-14T00:00:00Z

I follow several podcasts and, if you do the same, you may have noticed that podcast creators are terrible at consistently tagging their files. For example, is the artist the name of the podcast, the names of the presenters, just one of the presenters, some abbreviation, or the names of the presenters in a different order? Probably all of those, and more, get used inconsistently across the lifetime of a multi-year-old podcast.

Inconsistent tagging makes it a pain to use tools which use that information, which is pretty much every audio player.

For some years, my solution was a script which retagged all my podcasts. This can be done because I use a standard directory and file naming convention. But the downside is that it retagged every file of every podcast when ran, even though I’d only be adding one new file at a time.

systemd path units

I recently discovered systemd path units, which seemed like the solution to this problem: I could have a script which was triggered by a file being created, tagged it, and moved it to the right place. Path units turned out not to be the solution to this problem, but they were a solution to a slightly different one.

My first attempt was to add a subdirectory to every podcast directory called in, and to write this path unit:

[Unit]
Description=Automatically tag new podcast files
RequiresMountsFor=/mnt/nas

[Path]
PathExistsGlob=/mnt/nas/music/Podcasts/*/in/*.mp3

And this service file:

[Unit]

[Service]
Environment="PATH=<...>"
ExecStart=/usr/local/bin/tag-podcasts.sh
Group=users
User=barrucadu
WorkingDirectory=/mnt/nas/music/Podcasts/

And this bash script:

#!/usr/bin/env bash

for mp3file in */in/*.mp3; do
  dir="$(echo "$mp3file" | sed 's:/in/.*::')"
  f="$(basename "$mp3file")"

  artist="$(echo "$dir" | sed 's: - .*::')"
  album="$(echo "$dir" | sed 's:.* - ::')"

  if [[ -z "$album" ]]; then
    album="$artist"
  fi

  n="$(echo "$f" | sed 's:\..*::')"
  track="$(echo "$f" | sed 's:^[0-9]*\. \(.*\)\.mp3:\1:')"

  echo "===== $mp3file" >&2
  echo $artist >&2
  echo $album >&2
  echo $n >&2
  echo $track >&2
  echo "$(echo "$mp3file" | sed 's:/in/:/:')" >&2
  echo >&2

  id3v2 -D "$mp3file"
  id3v2 -2 --song   "$track"  "$mp3file"
  id3v2 -2 --track  "$n"      "$mp3file"
  id3v2 -2 --artist "$artist" "$mp3file"
  id3v2 -2 --album  "$album"  "$mp3file"
  mv "$mp3file" "$(echo "$mp3file" | sed 's:/in/:/:')"
done

This turned out not to work. The unit just didn’t pick up any file changes. Any one podcast would work¹, but I didn’t really want to have to make a unit for each of my podcasts… I’d need to update my system configuration if I started following a new podcast; that feels like too much mixing of global configuration and how I (admittedly the single user of the system) use it.

So I had to give up on path units for tagging my podcasts.

Tagging podcasts

I was too invested at this point to give up entirely, I wanted automatic tagging.

So I turned to inotifywatch, and stuck this at the end of my script:

# this can't be done as a systemd path unit because it doesn't seem to
# support multiple *s in a pattern
inotifywait --recursive --timeout 3600 --include '/mnt/nas/music/Podcasts/.*/in/.*\.mp3' $(pwd) >&2

# this script is run in a loop by systemd.

The next step was to make a systemd unit which just runs that script in a loop. Which is defined in my NixOS config as:

systemd.services.tag-podcasts = {
  enable = true;
  description = "Automatically tag new podcast files";
  wantedBy = ["multi-user.target"];
  path = with pkgs; [ inotifyTools id3v2 ];
  unitConfig.RequiresMountsFor = "/mnt/nas";
  serviceConfig = {
    WorkingDirectory = "/mnt/nas/music/Podcasts/";
    ExecStart = pkgs.writeShellScript "tag-podcasts.sh" (fileContents ./tag-podcasts.sh);
    User = "barrucadu";
    Group = "users";
    Restart = "always";
  };
};

And now I’ve got a script which, once an hour (or on detecting a file change, whichever is sooner) tags all new podcast files and moves them to the correct directories. No more SSHing in and running my tagging script, I can just save a file as, eg, How We Roll - Masks of Nyarlathotep/in/{number}. {title}.mp3, over Samba or NFS, and within a few seconds it gets picked up, tagged, and organised. Nice.

Tagging albums

I couldn’t use a single path unit to trigger my script for tagging podcasts, but that’s not the only time I want to tag some audio files. I have a collection of CDs, which I very infrequently add to, and I have those CDs ripped and stored as FLAC files. Appropriately tagged, of course.

I use Exact Audio Copy (EAC) to rip my CDs to WAV, which uses a predictable directory layout and file naming convention. I already had a script to take an EAC directory and produce a tagged and organised FLAC directory, I just needed to make it automatic.

First, here’s my systemd configuration:

systemd.paths.flac-and-tag-album = {
  enable = true;
  description = "Automatically flac and tag new albums";
  wantedBy = ["multi-user.target"];
  unitConfig.RequiresMountsFor = "/mnt/nas";
  pathConfig.PathExistsGlob = "/mnt/nas/music/to_convert/in/*";
};
systemd.services.flac-and-tag-album = {
  path = with pkgs; [ flac ];
  serviceConfig = {
    WorkingDirectory = "/mnt/nas/music/to_convert/in/";
    ExecStart = pkgs.writeShellScript "flac-and-tag-album.sh" (fileContents ./flac-and-tag-album.sh);
    User = "barrucadu";
    Group = "users";
  };
};

An album consists of multiple files, but I don’t want to try to convert an album where some of the files are still part-way through being copied to the NAS; that sounds like an easy way to end up with incomplete FLACs. So I came up with this workflow:

A CD is ripped to WAV with EAC on my desktop
The EAC directory is copied over to /mnt/nas/music/to_convert over Samba (which will take a few seconds)
Then the directory moved to /mnt/nas/music/to_convert/in (which will be instantaneous)
The path unit notices the new subdirectory, and triggers the script.

And here’s the script:

#!/usr/bin/env bash

set -e

for artist in *; do
  if [[ -d $artist ]]; then
    pushd $artist
    for album in *; do
      if [[ -d $album ]]; then
        echo "===== $artist - $album" >&2
        pushd $album
        if [[ ! -e "$artist - $album.log" ]]; then
          echo "(missing log file)" >&2
        fi
        if [[ ! -e "cover.jpg" ]] && [[ ! -e "cover.png" ]] && [[ ! -e "cover.gif" ]]; then
          echo "(missing cover file)" >&2
        fi
        flac *.wav
        rm *.wav
        for flacfile in *.flac; do
          n="$(echo "$flacfile" | sed 's:\..*::')"
          track="$(echo "$flacfile" | sed 's:^[0-9]*\. \(.*\)\.flac:\1:')"
          metaflac --set-tag="tracknumber=$n" "$flacfile"
          metaflac --set-tag="title=$track"   "$flacfile"
          metaflac --set-tag="artist=$artist" "$flacfile"
          metaflac --set-tag="album=$album"   "$flacfile"
        done
        popd
        echo
        mv $album "../../out/$artist - $album"
      fi
    done
    popd
    rmdir $artist
  fi
done

Nice and straightforward. I’ve not ripped any new CDs since setting this up earlier this week, but I converted some FLACs back to WAVs, shuffled the directory layout around, and tested that they got picked up and re-converted properly.

For example, if I changed the PathExistsGlob to something like /mnt/nas/music/Podcasts/The H. P. Lovecraft Literary Podcast/in/*.mp3↩︎

AI Dungeon Fiction: The Red Sign

2020-07-26T00:00:00Z

This story was generated by AI Dungeon. Bold text was written by me, the rest by the AI.

Your good friend Jackson Elias calls for your help with a telegram scarce on details. Something about the ill-fated Carlyle Expedition of 1920, where everyone ended up dead. He said he’s found something. You arrive at his hotel room and knock on the door, but get no response. The door is ajar. You push it open and see a man in a dark suit standing there.

“I’m here to help you,” says the stranger. “You’re not going to believe what I’ve found.”

He leads you into an empty office and closes the door behind him.

You say “Where’s Jackson Elias?”

“Jackson? What happened to him?”

The man turns around and looks directly at you. His face is pale, his eyes are bloodshot and his hair is disheveled. “I don’t know anything about that,”

You say “This is his room! He should be here!”

“The less you know about him the better. But I need your help.”

He takes you into the next room, where you see an ancient stone pedestal with an irregularly-shaped gem placed on it. A strange pattern of symbols surrounds it.

You examine the pedestal.

You approach the pedestal and examine the gem. It’s cut in an irregular fashion, with a dull gray color. You notice strange red markings etched into the surface of the gem.

“The Red Sign,” he says. “The key to uniting the worlds.”

“What?” you ask. “What does that mean?”

“You’re not a scholar, are you?”

“No. I’m a traveler.”

“Just as well. They’d skin you alive if they knew what you were saying. These are old secrets, forgotten by most.”

“I’m not afraid,” you say, puffing out your chest.

“Maybe you should be,” he says. ”

You say “What’s all this about?”

“The old ones. The Great Old Ones. The Lords of the Outer Rims. The Elder Gods. Gods from before the gods, some say. They dwell in a realm of twisted madness that lies on the far edges of reality, just outside the material world.”

You say “Look man, I just came here because my friend Jackson Elias needed my help. If you won’t tell me where he is, I’ll call the police.”

He stares at you, his eyes filled with a mixture of fear and desperation.

“Please, don’t do that,” he says. “They’re everywhere. And if they know I’ve betrayed them, they’ll take it out on my family.”

You look him up and down.

You say “Who are you?”

“Pardon me. I am Julian Cross. Jackson is my… brother in all but blood. We grew up together.”

He hands you a calling card, which you examine. It simply reads “Julian Cross, Antiquarian.”

You say “Jackson never mentioned you”

“No. We’ve had… a bit of a falling out. It’s a long story.”

“What happened?”

“I’ll tell you once we’re out of here.”

He approaches the pedestal.

You say “This… Red Sign has something to do with Jackson?”

“This is what they used. A powerful artifact, steeped in ancient and forgotten lore. It was to be the key to opening a gate to their realm.”

You stare at the Red Sign. You don’t even know what to say. This is madness.

You say “Look, I don’t believe in any of this magic stuff. If it’ll make you feel better, I can draw an Elder Sign on that Red Sign. Jackson always said an Elder Sign would cancel out magic. Then you’ll tell me where he is, right?”

He hesitates.

“Go ahead,” he says.

You pull out a pen and paper and begin drawing the Elder Sign. You step towards the pedestal and, on second thought, switch to a pencil. No need to take chances.

You step up to the pedestal and draw the Elder Sign on the Red Sign.

“There,” you say, stepping back. Suddenly, the lights begin to flicker.

“They know,” he says. “I’ve betrayed them. I have to go.”

The lights flicker again, then go out completely. You feel the door slam shut, then the sound of several bolts sliding into place.

“Julian?” you call out. Then you hear his terrified scream. Then, another voice… many voices… crying out in an incomprehensible language.

Then, a low rumbling. The sound of thousands of crawling insects… the sound of a hurricane… the sound of a crash of breaking glass…

Then, silence.

You continue to call out for Julian, but receive no answer.

The next time you wake up, you find yourself lying on the floor of the office. The desk has been overturned, and there’s a large hole in the window. A quick check reveals that it’s the next morning, and you’ve lost a full day.