Purpose of second partition in /dev/xvdc?

skyvine · March 1, 2024, 12:26am

I’m looking at the simple init script which might be run on boot in a VM (the relevant alternative being the full_cow_setup script, but the differences between the code that I am looking at are superficial aside from the udev setup which occurs at the very end of qubes_cow_setup.sh and does not appear in simple.sh). In particular, this snippet is confusing when I look at the actual state of my running VMs:

if [ `cat /sys/class/block/$ROOT_DEV/ro` = 1 ] ; then
    echo "Qubes: Doing COW setup for AppVM..."

    while ! [ -e /dev/xvdc ]; do sleep 0.1; done
    VOLATILE_SIZE_512B=$(cat /sys/class/block/xvdc/size)
    if [ $VOLATILE_SIZE_512B -lt $SWAP_SIZE_512B ]; then
        die "volatile.img smaller than $SWAP_SIZE_GiB GiB, cannot continue"
    fi
    /sbin/sfdisk -q /dev/xvdc >/dev/null <<EOF
xvdc1: type=82,start=1MiB,size=${SWAP_SIZE_GiB}GiB
xvdc2: type=83
EOF
    if [ $? -ne 0 ]; then
        echo "Qubes: failed to setup partitions on volatile device"
        exit 1
    fi
    while ! [ -e /dev/xvdc1 ]; do sleep 0.1; done
    /sbin/mkswap /dev/xvdc1
    while ! [ -e /dev/xvdc2 ]; do sleep 0.1; done

    echo "0 `cat /sys/class/block/$ROOT_DEV/size` snapshot /dev/$ROOT_DEV /dev/xvdc2 N 16" | \
        /sbin/dmsetup create dmroot || { echo "Qubes: FATAL: cannot create dmroot!"; exit 1; }
    /sbin/dmsetup mknodes dmroot
    echo Qubes: done.
else
    echo "Qubes: Doing R/W setup for TemplateVM..."
    while ! [ -e /dev/xvdc ]; do sleep 0.1; done
    /sbin/sfdisk -q /dev/xvdc >/dev/null <<EOF
xvdc1: type=82,start=1MiB,size=${SWAP_SIZE_GiB}GiB
xvdc3: type=83
EOF
    if [ $? -ne 0 ]; then
        die "Qubes: failed to setup partitions on volatile device"
    fi
    while ! [ -e /dev/xvdc1 ]; do sleep 0.1; done
    /sbin/mkswap /dev/xvdc1
    ln -s ../$ROOT_DEV /dev/mapper/dmroot
    echo Qubes: done.
fi

Based on the “comments” (echo mesages at the top of each conditional block) I would expect that an AppVM would have a /dev/xvdc2 and that this is related to the mechanism that allows root to be writable without writing back to the template VM or copying every file on boot. Basically, I would expect that /dev/xvdc2 contains the changed files which are used instead of the original filesystem when they exist. I would also expect that template VMs have a /dev/xvdc3 and the symlink is set up only so that other modules don’t have to care about whether they are in a template VM or AppVM - they can just use /dev/mapper/dmroot either way.

The above are all guesses based on my understanding of how QubesOS operates and what would make sense based on the contents of the script, but I have not used dmsetup in the past so I might be missing some things that are obvious to more experienced people.

When I look at my AppVMs though (and I am only looking at ones based on Fedora templates so that things look as “normal” as possible), they have a /dev/xvdc3 and /dev/mapper/dmroot is symlinked to it, which happens in the code block noted as being for template VMs.

What am I missing?

skyvine · March 5, 2024, 9:30pm

OK, I took another look at this and I think I understand what’s going on.

The documentation does describe /dev/xvdc as being used for storing writes temporarily (and I was probably subconciously remembering this page when I “guessed” this purpose, although the comments in the script itself were also highly suggestive). However, there is also this note in qubes-core-admin/doc/qubes-storage.rst:22 (this is the repository that implements qube lifecycle management):

volatile - this is used for any non-persistent data. This includes swap,
copy-on-write layer for a future read-only root volume, etc.

It says that the read-only root is a future thing, which is reinforced by the value in /sys/class/block/xvda3/ro on one of my AppVMs (as expected, since this is the value that the init script is checking). I’m not sure if that’s still the direction that development is heading towards, but currently the system uses LVM snapshots to handle the non-persistent root behavior of AppVMs. At qubes/vm/qubesvm.py:1805, the create_on_disk method calls self.storage.create() to set up the LVM modules. In qubes/storage/lvm.py:37 there is a helpful docstring (abbreviated for length here):

Where suffix can be one of:
    "-snap" - snapshot for currently running VM, at VM shutdown will be
    either discarded (if save_on_stop=False), or committed
    (if save_on_stop=True)

On VM startup, new volume is created, depending on volume type,
according to the table below:

snap_on_start, save_on_stop
True ,         False,        - "-snap", snapshot of last committed revision
                               of source volume (from VM's template)

This seems to supply a sensible explanation because snapshotting on start means that it will pick up any changes that have occurred since the last time the template was shut down, which is the behavior that is seen on a current QubesOS installation.