Qubes Air will not support online/offline migration. True meaning of it. Plans

alimirjamali · October 1, 2024, 2:49pm

Background

During Qubes OS summit 2024, it was revealed that parts of Qubes Air might come to r4.3, but sadly migration will not be a part of it.

For those who are not familiar with the concept of Qube (Virtual Machine) migration, it is a technology which allows transfer of Virtual Machines over network (Ethernet, WiFi, …) between two hypervisors (physical machines). This would eliminate requirement for backup/restore to/from spinning rust (old HDDs) or NAND (Flash drives and SSDs) for transferring qubes from an old computer to a new one. Which should be really convenient.

There are two kinds of migration:

Offline migration

This kind of migration requires you to turn off the VM. Only virtual volumes and some metadata is transferred between two machines.

Online migration

This technology does not require the VM to be turned off. Both hypervisors should emulate a similar CPU. Storage volumes and live changes to them could be gradually transferred over network; or they could be on some sort of clustered file-system, distributed storage (SAN, VMFS, Fiber-Channel, CephFS, GlusterFS, …). And the running VM memory is gradually transferred over network. This is a high-availability scenario which is only specific to server/cloud world. I am not aware of any Desktop OS capable of doing it.

Plans

While developing an on-line migration mechanism for Qubes OS is hypothetically possible, this is a very expensive technology to develop. This is what the big guys are spending a lot to have for their cloud servers (VMWare, Google, Amazon, …). So we could skip it.

Off-line migration is something which might be achievable in long term. This demands close monitoring of Qubes Air development and its capabilities to see if migration could be implemented on top of it.

otter2 · October 19, 2024, 9:41am

Why only some metadata? And why only “Only”? Is there anything else besides volumes and metadata?

alimirjamali · October 19, 2024, 9:47am

Metadata is indeed a broad concept. This includes generic VM configuration such as CPU core count, allocated memory, networking configuration, etc. And in the case of qubes, label, class, features, tags, …

Data related to VM which is not stored within VM’s volumes.

otter2 · October 19, 2024, 9:57am

So, nothing except of volumes (and metadata that is literally written on them) is transferred?

alimirjamali · October 19, 2024, 10:14am

In case of (hypothetical) Qubes OS offline qube migration, it could be exactly like that.

In case of enterprise VM/Cloud offline migration (e.g. VMWare vMotion, Proxmox Migration), even the volumes might not be transferred as they might be on some SAN storage which is shared between hypervisors. Just some of metadata is transferred.

alzer89 · October 25, 2024, 8:06am

What @alimirjamali means by “volumes and metadata” is essentially (but not limited to):

The partitions inside the qube (volumes)
- /dev/xvda, /dev/xvdb, /dev/xvdc, etc.
The metadata
- The output of qvm-prefs <qube>, more or less
  - These values are contained in the XML files corresponding to each qube
  - They specify things about the environment that Xen needs to create to boot the qube successfully

This, with a few caveats, is a qube, at its barebones.

This is mostly all you need to be able to:

Run qubes remotely on someone else’s hardware
- Homelab server
- Remote computer
- Cloud Provider’s hardware
Back up qubes, and move them around between machines
- Either when they’re shut down (“at rest”), or potentially while they’re still running

Context for Qubes Air, Data Centers, and “Live qube Transfer”

@alimirjamali is also describing the way gigantic data centers are structured in terms of hardware and networking.

They consist of tens of thousands of individual computers/servers (and those computers are usually pretending to be tens of thousands more), gigantic storage arrays (imagine a warehouse that is full of nothing but hard drives with YouTube or Netflix media on them).

Imagine how many people per second are asking that data center for files. Imagine how many of those are asking for the same file.

In your regular home or office network, you generally have the following:

One (or two) ways to get out of the network (WAN endpoints)
One (or two) big devices that everything is plugged into, that sorts, coordinates and forwards everyone’s data packets (Switch)
Usually a big difference in speed when accessing devices that are inside the network, to devices that are outside the network (because most ISPs believe it’s morally right to extract value from artificially throttling/limiting the capabilities of physical hardware )
Devices are generally connected to each other using the “Hub and Spoke” method.

Data center networking is a whole different ballgame:

Hundreds, if not, thousands of ways to get in and out of the network (endpoints)
Almost every server needs to be able to talk to the internet just as fast as other servers inside the data center
Parts of the network need to be kept separate from each other, even though they likely use the same cables
Usually the hard drives will generate enough traffic that they need to be on their own dedicated network
The backups of the backups have backups, in case the backups of the backups fail
- Take a shot every time someone says the word “redundancy”

This is why data center networks are usually done like this:

This is called the “Spine and Leaf” method.

There are some exceptions (some people do have quite a lot of money to spend on their home networks ), but in general, Data Centers have a bottomless pit of money to be able to buy “the best of the best” in computer hardware and infrastructure.

They also change/rotate/upgrade that hardware faster than most of us change/upgrade our wardrobe/clothes!

(No joke, there are Data Centers that employ people whose ONLY job is to spend 8 hours a day going around to all the servers with a trolley of brand new SSDs, and swapping out the old ones that have died with fresh new ones, because the old ones have been written to so much that they’re now read-only. And I can guarantee you that it would be 3 x 8-hour shifts for round-the-clock drive-swapping)

Most of them have also promised their customers that they will “always be able to access their stuff, no matter what, 24/7/365.25”. They have also usually never told their customers what actually happens to their stuff behind the scenes (and to be fair, most customers couldn’t care less, as long as they can access their stuff when they want to, and nothing bad happens to their stuff).

This is important in the case of VMs. If you were asking a Data Center to run a VM for you containing your company network LDAP database (so your employees can log into their workstations/portals), you’d definitely want that to be always running.

You probably wouldn’t be too happy if the Data Center shut that down for an hour during the day to transfer it to a different server (and none of your employees could log in and just sat there getting paid to do nothing…).

This is why the Data Centers have spent a lot of time, effort and money developing ways to be able to move customers’ VMs around while they’re still running, internet-connected, and performing their tasks.

We know how some of those methods work (FOSS), and some we don’t (proprietary).

How does this relate to Qubes Air?

There will definitely be people out there who will happily put (some of) their qubes in a Data Center, where they will want to be able to access them at lightning-fast speed, and not have to worry about where they actually are, what hardware they’re running on, etc.

These people would probably also be ok with the Data Centers running their qubes for them (outsourced execution), even if it means allowing the Data Centers to see/understand everything that goes on inside those qubes.

There will also be people out there who would be ok with Data Centers storing their qubes “at rest” for them, but running them on hardware that they have complete control over (local execution). For these people, they will either require:

a constant uninterrupted stream connection to their storage
- every read/write will require a data packet leaving/entering their Qubes OS machine
they will need to copy over their entire qube’s contents in one go, execute it, and then copy it back upon qube shutdown
- Think git branches/merges but with a qube (could be cool)
A hybrid combination of the two
- With a healthy dose of encryption so you can give to someone and say “hold my beer”

This means that online/offline migration isn’t something that can be feasibly written in code any time soon, so they’ve chosen to keep it on their wishlist, and focus on what they think they can achieve given their current resources.

I can also see “migration” (as opposed to straight copying of the 0s and 1s) of a qube having this uncanny ability to produce unforeseen circumstances, forcing you to “go back to square one” multiple times, and is honestly, in fairness, more work than it’s worth at this point in time.

There will be a way (in fact, there will likely be many ways…), but at the moment, it’s too under-resourced to investigate…

…and qubes-backup currently meets the needs functionally for those that wish to move qubes from one machine to another at this point in time.

But still, being able to remotely control another Qubes OS machine using this protocol is exciting. Being able to cluster and orchestrate Qubes OS machines together is the first step to making the other things a reality.

Very exciting times, indeed.

otter2 · October 25, 2024, 9:53am

Cool, thanks for the clarification! In the original description it kind of looked like not all parts required to run a qube are getting transferred

I mean, they don’t need to integrate this into cloud or any kind of a protocol for it to be useful. Storage management is one of the Qubes’ glaring problems even on a local install. Constantly cloning stuff is very bulky and slow, and it limits our ability to compartmentalize things. I see snapshot version management as a solution.

alzer89 · October 25, 2024, 11:51am

Work is being done on all of this, for sure.

The vanilla Qubes OS install is what we know is tried and tested against bugs and known vulnerabilities, fairly stable, albeit a little clunky and janky, sufficiently usable, reasonably idiot-proof (important for beginners), and we are willing to put our name behind it.

But I promise you, there are a lot of things out in the wild that have been added onto Qubes OS by the community.

Honestly, I’m not quite sure how cloning qubes impacts your ability to compartmentalise things. That’s a bit of a stretch, to be honest. But I suppose it would increase the storage requirements, processing requirements, memory requirements, etc…

It’s done because there are quite a few users out there that need that level of certainty that there’s no way that the process can be exploited by anyone (journalists, security researchers, “Dave from Finance”, etc.). Hence why it’s still the standard.

But I do agree with you that there are benefits to be had from “delta templates” or “snapshots” if you will, if done right, with no compromises as far as security/integrity goes.

It would allow you to stack multiple templates on top of each other, before you overlay the qube. You could even have a single application in a template, and overlay that application (and all it’s /lib, /etc, /usr, and /opt files) across an OS template, allowing you to (potentially) hot-swap the application onto a different OS, with all its configuration intact.

Yes, there’s a lot in this that can go wrong (trust me, I realised more the more I wrote it), but that’s the point that I’m trying to make. The devs have definitely thought about cool ideas like this, realised the fact that a cool idea takes a lot of time to become a solid one, and have opted to keep it simple, and build incrementally from there.

For every solid idea, there are at least 30 other cool ideas of a Qubes dev’s machine, in various states of functionality

Side-note:

If you (or anyone reading this) have any code, by all means, commit it. Even if it’s broken, problematic, “not your best work”, or whatever. At least it will get the ball rolling for the community

otter2 · October 25, 2024, 2:11pm

Indirectly. Cloning is bulkier, and storage quickly becomes the limiting factor, especially if you’re trying to keep your system mobile

(or if you have a lot of projects that must be kept in standalones, although I must admit that this may also be solved with nested virtualization or containers)

New terminology acquired