Apr 11, 2021

#134

Work

Another short week, with the bank holiday Monday.

This week I improved some documentation, set up continuous deployment for another of our apps, tidied up a bit of old code, and opened another RFC; this time on improving our app healthchecks.

Books

I didn’t finish any books this week. I’ve been reading the Google SRE books and re-reading Helliconia.

Next Programming Project?

I’ve been trying to think of a big meaty project I can start. Something to get stuck into, which will stretch what I can comfortably do. And I think I might have found the right idea.

My own mini-kubernetes. Or at least, the scheduling and monitoring part of it. Here’s the somewhat-fuzzy thinking so far:

Job & Pods

The right level of scheduling is a “job”. A job defines:

And a container defines:

A full definition might look like this:

name: "simple-webapp"
numInstances: 3
containers:
  app:
    # normal container configuration, similar to docker-compose.
    image: web-app-image
    restart: always
    command: ["bin/rails", "s", "--restart"]
    environment:
      - name: DATABASE_URL
        value: postgresql://webapp:webapp@localhost/webapp
    ports:
      - "80:3000"
    depends_on:
      - db

    # have a hard memory limit of 2G, after which allocations will fail
    memory: 2G

    # call /healthcheck every 60s, timing out after 30s, and require two
    # consecutive 200 status codes to switch to "healthy" state, and two
    # consecutive non-200 status codes to switch to "unhealthy" state.
    healthcheck:
      type: http
      port: 3000
      path: /healthcheck
      timeout: 30
      interval: 60
      healthy_threshold: 2
      unhealthy_threshold: 2

    # mount a 1G shared volume at /data/shared, all instances get the same
    # shared volume and see each other's writes; mount a 100M non-shared
    # volume at /data/scratch, which will be lost if the instance or host
    # machine is terminated.
    volumes:
      - type: shared
        path: /data/shared
        space: 1G
      - type: local
        path: /data/scratch
        space: 100M

  db:
    image: postgres:13
    restart: always
    memory: 4G
    environment:
      - name: POSTGRES_USER
        value: webapp
      - name: POSTGRES_PASSWORD
        value: webapp
      - name: POSTGRES_DB
        value: webapp
    volumes:
      - type: local
        path: /var/lib/postgresql/data
        space: 10G

This is more-or-less a docker-compose file. Ideally the syntax of this is very close to what people know already.

The Scheduler

Each member of the cluster is running a scheduler process. This is a Haskell application I would write. I’m thinking calling it Ozymandias.

The schedulers store all their state in, and communicate through, etcd: a distributed, strongly consistent, key-value store.

When a user submits a new job

The scheduler checks if the job is feasible by trying to bin-pack the desired number of instances to the nodes in the cluster. It does this based on the memory and disk space constraints.

If there’s not enough free memory or disk space in the cluster, the scheduler rejects the job by returning a problem details object:

{
  "type": "https://github.com/barrucadu/ozymandias/main/docs/errors.markdown#insufficient-memory",
  "title": "Insufficient memory for job",
  "detail": "Job requires 12G of memory but only 10G can be allocated.",
  "job": "my-cool-job"
}

If the scheduler did manage to compute an allocation of instances to nodes, it atomically commits the new job and the schedule to etcd, and returns an OK response to the user.

When a user scales a job up

Much the same as when a job is created, the scheduler tries to compute an allocation for the new instances, either committing the updated job and new schedule, or returning an error.

When a user scales a job down

The scheduler picks instances to terminate, and commits a new schedule with those instances gone.

When a scheduler notices it has been allocated more instances

It starts them up, using podman as the container runtime. Pods are started on a flannel-managed network, meaning that any pod can talk to any other pod in the cluster, regardless of which host it’s running on.

When a scheduler notices it has been allocated fewer instances

It terminates any unnecessary instances and cleans up any local storage they’ve been using.

When a scheduler notices an instance is healthy

A record is added to the coredns state in etcd, so that other pods can find it by name.

When a scheduler notices an instance is unhealthy

Its DNS record is removed from the coredns state, and it gets restarted in line with its restart policy.

Fault Tolerance

A scheduler stores all of its state using an etcd lease, so that if a scheduler crashes and doesn’t come back online, its state expires and the rest of the cluster knows that any instances running on it need to be re-allocated.

When the scheduler starts up, it checks if there is already a lease for that node (this will be the case if the scheduler crashed and restarted before the lease expired):

If one scheduler notices that another has gone offline, it computes an allocation of job instances to the remaining nodes and atomically commits that. The new instances then start up as normal. If there aren’t enough cluster resources to bring up all the instances, as many are brought up as can be.

As the new schedule is committed atomically, and only if the current schedule hasn’t been updated, there is no need for a master scheduler to handle fault tolerance. Each scheduler can race to reallocate the missing instances, and the first to compute a schedule will win.

If there is a network partition in the etcd cluster, then some schedulers will be unable to refresh their leases or see state changes:

Ingress

Each node will run an HTTP ingress process. When that receives a request, it checks the Host header, and proxies the request to the appropriate job.

I’ve not really thought about non-HTTP ingress yet, but I suspect I’d have to do something like run one ingress process per port, which proxy connections based on the port rather than by a header. I’d like to make ingress work globally, so you don’t need to know which node a process is running on.

Next steps?

Just implementing the scheduler and ingress process would be a big task, but there are always future improvements:

Software Engineering

Roleplaying Games

Miscellaneous