Skyforge

We ain't no architects. Ain't got the intellect.

An Introduction to Jails and Jail Networking

2018-03-14

Generalities on Jails

Jails basically partition a FreeBSD system into various isolated sub-systems called jails. The syscall and userspace tools first appeared in FreeBSD 4.0 with subsequent releases expanding functionality and improving existing features as well as usability.

For many users, modern jails will exhibit a feeling similar to LXC on GNU/Linux, and are, just like LXC, used for ressource/process isolation. Unlike LXC however, jails are a first-class concept and are well integrated into the base system. Essentially however, both offer a chroot-with-extra-seperation feeling.

Since jails have been around for quite a longer than most of their modern day equivalents, they are fairly well tested and are - as we will see - surprisingly easy to use in all recent releases. However, the many changes have also introduced their fair share of quirky behaviour, some of which might be considered downright weird or at least slightly misleading. These oddities are usually a source of great pain for anybody staring on the subject and this small post will aim to address the most common problems and quirks and provide solutions and/or workarounds whenever possible.

Setting up modern Jails: An Example

Setting up a jail is a fairly simple process, which can essentially be split into three steps:

Place the stuff you want to run and the stuff it needs to run somewhere on your filesystem.
Add some basic configuration for the jail in jail.conf.
Fire up the jail.

The typical scenario for the first step would probably a complete base system and for FreeBSD that’s extremely easy to do. All you need is a release tarball and a place to put it, which is nowadays usually a zfs subvolume, but any folder on your filesystem would do just as well.

~ # zfs create -p zroot/srv/jails/bsd-test01
~ # zfs set mountpoint=/srv zroot/srv
~ # fetch ftp://ftp.freebsd.org/pub/FreeBSD/releases/amd64/amd64/11.1-RELEASE/base.txz -o base-11.1-RELEASE.txz 
~ # tar xf base-11.1-RELEASE.txz -C /srv/jails/bsd-test01

This creates the zfs volumes /srv, /srv/jails and /srv/jails/bsd-test01 and extracts an 11.1 release tarball to /srv/jails/bsd-test01.

Next, we need to tell the various jail-related tools a few things about our soon-to-be jail. This can be done by placing a few, almost self-explanatory settings in /etc/jail.conf:

bsd-test01 {
        exec.start  = "/bin/sh /etc/rc";
        exec.stop   = "/bin/sh /etc/rc.shutdown";
        exec.clean;

        mount.devfs;

        path = "/srv/jails/bsd-test01";

        host.hostname = "bsd-test01.local";

}

All of these parameters are documented in the jail(8) manpage, but here’s a quick rundown of the options used above:

exec.start: The command (relative to the path of the jail) that should be executed when the jail is started.
exec.stop: The command (relative to the path of the jail) that should be executed when the jail is stopped.
exec.clean: This keeps your shell’s environment from being dragged into the jail, similarly to su vs. su -.
mount.devfs: Mounts a (restricted) devfs at /dev inside the jail.
path: The directory where the jail environment has been placed.
host.hostname: The hostname that will be preset inside the jail.

Once we’ve configured the jail in jail.conf we can use the aptly named jail utility to launch the jail:

~ # jail -c bsd-test01

To confirm that the jail started sucessfully we can use the jls utility:

~ # jls
JID  IP Address      Hostname            Path
  4                  bsd-test01.local    /srv/jails/bsd-test01

We can now enter the jailed environment by using jexec, which will by default execute a root shell inside the named jail:

~ # jexec bsd-test01
root@bsd-test01:/ #

If you’ve used LXC then this should all look awfully familiar to you. It’s pretty much the same stuff you’d do with LXC except that creating the jail requires no template because installing FreeBSD is just unpacking an archive.

The details: jails.conf

The general syntax of an entry for a jail in jail.conf is described in the jail.conf(8) manpage, but it generally looks like this:

jailname {
   parameter = "value";
   parameter = "value";
   ...
}

The available options can be found in the jail(8) manpage. Here’s an upshot of the most important things to keep in mind:

Comments start with #, // or /* ...... */ (multiline).
Options outside a jail-block are set globally for all jails below that option (the file is read top-down).
Auxillary variables may be defined w/ $var = "...".
Variables and options may be substituted w/ ${option} or simply $option.
Some variables permit a list of values (e.g. IPs). Once a value has been assigned to such a variable, others can be added using +=.

This synax allows for a clear and conscise rewrite of our example configuration:

exec.start  = "/bin/sh /etc/rc";
exec.stop   = "/bin/sh /etc/rc.shutdown";
exec.clean;
mount.devfs;

path = "/srv/jails/${name}";
host.hostname = "${name}.local";

bsd-test01 {
}

Expanding on this, it is usually best practice to set the securelevel of a jailed environment via the securelevel-paremeter to the highest possible value (3), even though most operations affected by the securelevel are prohibited by default.

Other common parameters include:

allow.*: These parameters lift certain restricted actions (i.e. sysvipc, mounting…).
ipv4.addr and ipv6.addr: These contain a list of IP-addresses to add to the jail (more on this down below).

Caution: Allowing certain actions may have unwanted side effects. This is particularly true for the allow.sysvipc option. Sysvipc inside the jail is not namespaced, so different jails may affect each other. This is a common occurence with some programs, most notably postgresql.

Update 2018-03-26: allow.sysvipc has been deprecated and its alternatives work much nicer. Here’s a post with more information on this subject. Thanks to Harald Eilertsen (@harald@quitter.no) for the heads up.

Options may be modified at runtime using the jail command:

jail -m securelevel=3 name=bsd-test01

In general, any configuration done via jail.conf can also be passed directly to the jail command, so you can create an ad-hoc jail without ever touching the configuration file by adding all parameters to the jail -c command.

Jails and Networking

Networking inside of jails is actually an oddly simple thing at a first glance. One can simply assign addresses to jails via the ipv4.addr and ipv6.addr parameters. The address will be created on the host and “patched” into the jail.

Example:

bsd-test01 {
	ip4.addr = "em0|10.10.0.1/32";
}

This allows the jail to use the address, but restricts everything else: A jail cannot add, modify or delete addresses or routes via ifconfig or other tools.

The Loopback Problem:

A jail can only see and use addresses that have been passed down to it by the parent system. This creates a slight problem with the loopback address: The host would probably like to keep that address to itself and not share it with any jail.

Because of this, the loopback-address inside a jail is emulated by the system:

127.0.0.1 is an alias for the first IPv4-address assigned to the jail.
::1 is an alias for the first IPv6-address assigned to the jail.

While this looks simple enought and usually works just fine[tm], it is also a source of many problems. Just imagine if your jail has only one single global IPv4 assigned to it. A deamon binding its (possibly unsecured) control port to the loopback-address would then unwillingly be exposed to the rest of the internet, which is hardly ever a good idea.

This is most commonly addressed by first creating a lo1-device and then assigning each jail a loopback address of the form 127.0.1.$jailid on lo1:

In /etc/rc.conf:

cloned_interfaces="lo1"

In /etc/jail.conf:

ip4.addr =  "lo1|127.0.1.$jailid/32";
ip4.addr += "em0|...."

Inside the jail, 127.0.0.1 would now be an alias for 127.0.1.$jailid, which would prevent the problem outlined above… except if we have more than one jail, since different jails can by default access each others loopback-addresses.

To fix this, one has to do the one thing that one usually avoids doing: Filtering loopback traffic. In pf (and all other packet filters), this can be done quite easily:

pass quick from (lo0) to (lo0)

pass quick from 127.0.1.3 to 127.0.1.3
pass quick from 127.0.1.4 to 127.0.1.4
pass quick from 127.0.1.5 to 127.0.1.5
....

Note: One might think that one could get away with skipping filtering on lo0, but loopback traffic, even that on lo1, is strangely entangled with lo0, as some auxiliary tcpdumping suggests.

IPv6 and the loopback Problem:

Finally, we still need to address the same problem for IPv6. However, there’s only one single loopback address in IPv6. Since IPv6 protocol designers seem to enjoy a certain degree of naivety, we also no longer have site-scope addresses that could fill this window… but we do have link-local address. And lo1 is it’s own link.

Sadly, simply assigning an fe80::-address to each jail isn’t going to work as easily as one might expect. Assigning Link-Local addresses to jails is somewhat broken in FreeBSD, but it can be done (see [2])!

We first need the scope identifier of the interface (in this case lo1), which can (somewhat ironically) only be seen after assigning an IPv6-address:

ifconfig lo1 inet6 fe80::dead:beef:1 add

Afterwards, ifconfig lo1 should reveal the scope identifier:

~ # ifconfig lo1
lo1: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
		[...]
        inet6 fe80::dead:beef:1%lo1 prefixlen 64 scopeid 0x3
		[...]

Remark: The scope-id depends on the order in which kernel creates the interfaces. This order may change if you suddenly add new devices or if a change in the kernel configuration adds pseudodevices such as enc0 (ipsec) or pflog0 (pf).

Once we have the scope-id we can assign the link-local address to the jail in jail.conf:

ip6.addr =  "lo1|fe80:$scopeid::dead:beef:$jailid/64";

Again, filtering needs to be done in pf or ipfw to prevent cross-jail communication.

A note on mixing internal and external IPv4 addresses:

The number of available IPv4-addresses on a system is often fairly limited, so most service containers/jails hide behind a few global front ends (such as reverse proxies). These front ends usually have both global and local addresses, to accept global requests and pass them along to the local service jails.

In jails, source address selection is fairly broken (see [1] ) and you may want to avoid the pitfalls by just assigning local addresses to jails. The global addresses can be created on the host and passed via 1:1 nat in pf or ipfw.

Example:

If em0 is your systems external interface, then you can add another IP to it via

ifconfig_em0_alias0="inet 1.2.3.4/24 netmask 0xffffff00"

In pf, passing traffic from port 80 and 443 to a jail with local address 10.0.0.5 could read like:

ext_if="em0"
ext_web_ip="1.2.3.4"
int_web_ip="10.0.0.5"

rdr pass on $ext_if proto tcp from any to $ext_web_ip port {http,https} -> $int_web_ip

Thin jails:

Thin jails are a result of the base-system/package split that FreeBSD employs. A system usually touches very few parts of the base system. Splitting those parts from the rest of the base allows one to have one read-only base system mounted inside several jails without any ill effects. This makes creating, maintaining and upgrading a large number of jails much easier.

To start off, we need to split the basesystem into the read-only parts (the “basejail”) and the parts that should be modifyable by a jail (the “template”).

Let’s create two directories/zfs-subvolumes and extract another base image:

zfs create -p zroot/srv/jails/basejail/default-11.1
zfs create -p zroot/srv/jails/templates/default-11.1

tar xf base-11.1-RELEASE.txz -C /srv/jails/basejail/default-11.1/
cd /srv/jails/basejail/default-11.1

Every directory that the jail has to write/modify needs to be moved to the template:

chflags -R noschg var/empty
mv etc dev media mnt proc root tmp var /srv/jails/templates/default-11.1

mkdir /srv/jails/templates/default-11.1/usr

mv usr/local usr/obj /srv/jails/templates/default-11.1/usr

mv .cshrc .profile COPYRIGHT /srv/jails/templates/default-11.1

Next, we need to switch to the template and create links for all the things that remain in the baseimage and will later be added to the jail via a nullfs mount.

cd /srv/jails/templates/default-11.1

mkdir basejail

foreach  n ( bin boot lib libexec rescue sbin sys )
	ln -s /basejail/$n $n
end

foreach n (bin games include lib lib32 libdata libexec ports sbin share src)
	ln -s /basejail/usr/$n usr/$n
end

At this point, we can modify the base template to our will to have good defaults for every new system:

/etc/resolv.conf
/etc/make.conf
/etc/periodic.conf
/etc/ssh/sshd_config
/root/.ssh/authorized_keys
/etc/login.conf

Remark: It is also possible to presintall certain packages to create a template that has e.g. all the packages installed required to immediately run ansible on a newly started jail.

To create a thin jail one simply copies the template to a new subvolume and tells jail.conf to mount the basesystem read-only at the /basejail subdirectory

zfs create -p zroot/srv/jails/bsd-test02
cp -a /srv/jails/templates/default-11.1/ /srv/jails/bsd-test02/

In /etc/jail.conf:

mount = "/srv/jails/basejail/default-11.1 /srv/jails/$name/basejail nullfs ro 0 0";

Upgrading thin jails:

Minor upgrades to a basejail can easily be done via freebsd-update:

freebsd-update -b /path/to/basejail fetch
freebsd-update -b /path/to/basejail install

If the version of the FreeBSD system inside the basejail is different to the one on the host (say you’re running a 10.3 basejail on an 11.1 host), you can use the UNAME_r environment variable to specify the release version:

env UNAME_r=10.3-RELEASE freebsd-update -b /path/to/basejail fetch
env UNAME_r=10.3-RELEASE freebsd-update -b /path/to/basejail install

Upgrades between major releases are best handled differently: Simply create a new basejail for the new system version and link the new basejail into the existing jail in place of the usual one:

Stop the jail.
Change the mount = ... option in jail.conf from the old to the new base jail

Upgrade the existing system configuration files in the jails using mergemaster:

mergemaster -F -t /path/to/actual/jail/var/tmp/temproot -D /path/to/actual/jail

Start the jail
Upgrade the packages if necessary (pkg-static install -f pkg, pkg upgrade)

Note: Upgrading the base-system often involves changes to files like passwd or group. It is advised to regenerate the databases associated to these files (and mergemaster can actually run those commands at the end of an upgrade), which can be done inside the jail using the following commands:

cap_mkdb /etc/login.conf
services_mkdb -q -o /var/db/services.db /etc/services
pwd_mkdb -d /etc -p /etc/master.passwd

Final remarks:

Jails are also integrated into quite a number of programs, both within the base system and in some packages available in ports, e.g.

pkg -j [name]
ps -J [name] -aux
htop and top

Feedback:

This talk is the result of a talk at the monthly BSD Stammtisch. Feedback is always welcome. If you have any questions, remarks or suggestions feel free to write an email!

References:

[1] Bug 168678 - jail raw sockets incorrectly choose source address when jail has multiple subnets

[2] Bug 206012 - jail(8): Cannot assign link-local IPv6 address to a jail

[3] FreeBSD Jails the hard way

A Note on SYSVIPC and Jails on FreeBSD