Rebuilding Cricalix.Net – Part 2

LXD’s Documentation

It’s mostly decent. It’s got a lot of detail on what all of the configuration sections are and generally provides examples. What I find missing is a set of practical documentation that guides someone through getting started with LXD – weaving together all of the configuration for devices, proxies, storage volumes, profiles and so on. I’ll probably get around to offering up some documentation in that vein once I’ve finished this rebuild; I’ve already contributed updates to the LXD documentation on cloud-init, because I found it a bit vague. It basically said “we support cloud-init, here’s how to configure the network of the instance, otherwise see the cloud-init examples”, but didn’t indicate how things like the user-data section worked.

There’s also a lack of “here’s when you should consider using this approach” documentation. One way of doing port forwarding in LXD is to use proxies, the documentation for which states

Proxy devices allow forwarding network connections between host and instance. This makes it possible to forward traffic hitting one of the hostโ€™s addresses to an address inside the instance or to do the reverse and have an address in the instance connect through the host.

https://linuxcontainers.org/lxd/docs/master/instances/?highlight=proxy#type-proxy

However, there are also network forwards, the documentation for which states

Network forwards allow an external IP address (or specific ports on it) to be forwarded to an internal IP address (or specific ports on it) in the network that the forward belongs to.

https://linuxcontainers.org/lxd/docs/master/howto/network_forwards/

So, two different ways to accomplish the same thing. The subtlety seems to be that you can assign a proxy in a profile, while forwards have to be assigned to an instance with CLI incantations. Both require knowing the IP address of the container, which is a bit of an issue if you’re not using stateful DHCP (or static addressing via cloud-init.network-config), but the proxy approach at least allows use of the container instance’s loopback address, in contrast to the network forward approach which appears to push traffic to the bridge IP of the container.

  proxy_587:
    connect: tcp:127.0.0.1:587
    listen: tcp:0.0.0.0:587
    type: proxy

The main drawback to using the proxy configuration is that the webserver (for instance) sees connections coming from the loopback address, so you lose any ability to work with the source IP. At least, when I tried deploying this in production, the webserver I ended up using only saw the loopback address for IPv4 requests (which makes sense if it’s a proxy).

It is possible to configure a LXD proxy in NAT mode (nat: true), but there are some criteria for that to work

  1. Assuming a bridge network, it must be managed (default state for LXD)
  2. IPv6 must be configured as stateful (ipv6.dhcp.stateful: true for the network)
  3. Static addressing of the container is done in the LXD network configuration
  4. The connect and listen statements have to refer to the static IP of the container and host respectively

The configuration looks something like

devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
    # Assumes lxdbr0 is configured with 10.68.0.1/24
    ipv4.address: 10.68.0.73
    # Taken from the lxdbr0 configuration after first-time DHCP
    ipv6.address: 2a01:f12:c012:7dfb:c1d1:5420:9291:bf24
  proxy_80:
    # IPv4 from eth0 config
    connect: tcp:10.68.0.73:80
    # IP address of host machine
    listen: tcp:1.2.3.4:80
    nat: "true"
    type: proxy

Infrastructure as Code?

In the world of operations (or DevOps, or sysadmin, take your pick), Infrastructure as Code (IaC) has become the de-facto way of doing things if you have more than a few containers or bare metal servers. Express your configurations for deployments in a DSL of some sort, and use a program to interpret the configuration into instructions that configure a machine. There are a number of players in this space, including CFEngine, Puppet, Ansible, and Chef. Then there are orchestration layers like Kubernetes and Terraform for organising the inventory of nodes and so on.

I took a look at the documentation for Puppet (which I last used over a decade ago), Ansible, and Chef. In every case, they seemed like utter overkill for my tiny group of containers, and introducing a control plane via tooling like Puppet, Ansible, or Chef feels like a lot of work for not much gain.

Instead, I’m using LXD’s cloud-init support, a couple of shell scripts, and a private Git repository on Github. This is going to be tailored to my needs, and will probably be a little janky and rough, but it’ll do what I want it to do without having to stand up infrastructure for the IaC tooling.

I’m accepting a very specific risk with this approach, and that’s storing my secrets in a GitHub private repository. Secrets like the credentials for my Let’s Encrypt account, the token that can access that GitHub repository, and the credentials for the SMTP relay that I use. I know this is not a great way to store secrets, but if I’m not running somewhere like the major commercial clouds (i.e., AWS, Azure, GCP), I don’t have magic credential stores available to me. I am absolutely not advocating that anyone else do things this way.

Using cloud-init for initial configuration

In the absence of a full IaC setup, LXD’s support for cloud-init is something that can reduce the amount of manual work when setting up containers repeatedly (like during testing). With the addition of the private git repository, and a script that can read files from the repository and set user/group/permissions on files, it’s possible to get some automation in place.

YAML config + perl + sed + yq + plantuml = visualisation of the configuration

That diagram comes from an in-progress profile for the mx container; PlantUML has support for drawing diagrams from YAML, though it does not support the YAML literal style (hence the pipeline that includes perl, sed, and yq). The profile uses YAML literals for the cloud-init.user-data segment, and that segment uses it for file content. The pipeline also truncates the commands in runcmd for simplicity, hence the repeated systemctl entries; I could probably replace them with more shell scripts. If you look at the source of the SVG, you can see the sanitised YAML data.

#!/bin/bash
FPATH=$1
BASE=$(basename "${FPATH/.yaml/}")
if [ ! -f "${FPATH}" ]; then
       echo "Can't find ${FPATH}"
       exit 99
fi

perl -pe 's/- \[\s+([\w\/\.-]+),.*\]/- [ $1 ]/' < "${FPATH}" |\
sed -e 's/|//' |\
yq '(.config."cloud-init.user-data".write_files[].content) |= "IyEvYmluL2Jhc2gK"'|\
sed -e '1i @startyaml' -e '$a @endyaml'|\
java -jar ~/bin/plantuml-1.2022.6.jar -tsvg -pipe > "${BASE}.svg"

Because this isn’t a full IaC setup, there’s no concept of test versus production deployments, other than using different YAML configs and piping them to lxc profile edit <thing>. That’s a bit of a pain, because the test and prod configs can drift without diligent attention to changes being made. However, for my purposes, it’s fine.

A drawback to using this approach is that shell scripts failing will only result in a log entry; there’s no sign otherwise that the setup has been broken. An orchestration layer on top would probably have monitoring for this, but I don’t have an orchestration layer (and I’m not building one).

Using git as configuration management

Earlier, I mentioned that I have a private GitHub repository for various configuration files that need to be deployed to the containers. The git-setup.sh script in the config above is sent as part of the cloud-init.user-data manifest (base64 encoded), and all other files needed by the container are stored in the repository. The repository has a simple tree structure – directories named after the container name (mx, mail, vpn etcetera), and then files are laid out as they appear in the container. If I need to have specific ownership or permissions set, three dotfiles can be used.

On the first boot of the container, the git-setup.sh script is run; this clones the private repository in sparse mode, then checks out two portions of the repository – a support directory, and the directory named for the container. The git-copier.sh script is then copied over to its permanent home, and made executable. The next runcmd entry runs this script, and it

  • ensures directories are created first
  • checks for .owner, .group, and .perms files in the repository copy of the directory, applying them via chown, chgrp and chmod
  • scans for all non-dotfiles in the source tree, copying them to their final destination
  • does the same check for owner/group/permissions, but looking for .<filename>.(owner|group|perms)

Is it a dirty hack? Yes. Does it look like a poor clone of an IaC tool? Certainly. Is it easier than spinning up a full IaC setup? Absolutely. Did I nerd-snipe myself with this? Yep.

If I make local changes in a container to tune the configuration, it’s on me to remember to update the git repository so that future container deployments will have that tuning.

The runcmd section only runs on the first boot, so I don’t have to worry about it doing things like resetting databases.