Qubes Backup is Slow

deeplow · September 28, 2020, 6:26am

Yup. I use both! (online and offline git)

Thanks for the Tip. The problem arises when you need constant access to a pool of documents (such as a bibliography manager) and you never know which ones you’ll need. (and to which you’re constantly adding new ones). But I’ll try to implement this and see if works for my particular situation

So true.

Certainly FUD for most threat models. But Full Disk Encryption only works when the machine is offline and the ram is cleared.

For a threat-model high enough I don’t think it’s unreasonable to have these sort of considerations.

adw · September 28, 2020, 7:59am

Of course certain data transfers between VMs are safe, otherwise qvm-copy and qvm-move would have no reason to exist. Generally, the rule is that it’s safe to copy/move data from more trusted to less trusted VMs but not vice versa.

Two ideas:

Separate the bibliography manager from the underlying source files it tracks.
Use a qrexec service to access the source files in a backend VM.

Granted, if one of the threats you need to defend against is people breaking into your place while you’re asleep and physically restraining you while they use liquid nitrogen on your DIMMs as they cart your computer away to a lab, then yes, I suppose it’s important to make sure your drive is encrypted and keys are wiped from memory whenever you’re not awake and alert near the machine. At that point, though, I think physical security starts to trump digital security (i.e., rubber-hose cryptanalysis).

On a related note, encrypt-on-suspend might make it into Qubes at some point.

deeplow · September 28, 2020, 4:32pm

Great ideas!

Yep

Awesome! I didn’t know about that.

qufo · September 29, 2020, 9:38pm

Yeah…that would be great.

Yes.

Agreed.

qufo · September 29, 2020, 9:54pm

Did you try to upload let say 50GB heavily encrypted blog on Google-Drive? Google refused to store videos which I produced, and they weren’t even encrypted at the time.

Depends…lets say there is activist or investigative journalist who did certain work already, hence is already on the cross-hairs, then it might not go well for that person. Keyword metadata: has a higher value than content. Keep that in mind.

Yep, I know about that issue.

I don’t think so. Since Qubes-OS is basically nothing else that multiple computer in one, hence it has to be treated like set of computers or traditionally like a computer network.

A note to the notion of bulk data. Fact-checkers, investigators, analysts have their own libraries. For example: RSS feeds, mailing lists, articles (downloading articles is necessary because often enough articles simply vanish). Over time one creates a very valuable archive/library local. Meaning it’s always available. Only to give you an idea: I have a email collection of over 100k emails plus articles, slides videos etc. in total almos 150GB, and that’s without my library. So, according to your proposal I would have that some where neatly stored away. So, but I use it on a daily basis. In order to have the emails stored away I would need to maintain an off-line read-only mail server, but then I can’t search across all my data and so on… It’s read-only but active and increased daily by around 0.5GB.

tasket · September 29, 2020, 10:56pm

You may want to check out https://github.com/tasket/wyng-backup

This was referenced earlier as ‘sparsebak’. Wyng can perform very fast
incremental backups of large volumes, and it can also remove old backups
from the archive quickly. You can also keep pushing backups to it
indefinitely as the next release candidate will monitor disk space and
prune old data automatically to make room.

Wyng does not yet have integrated encryption (which can be currently
handled in dom0 by using an attached LUKS container) and it doesn’t
handle Qubes VM settings for you. I plan to address both these issues in
the next iteration, starting with a wrapper script that handles VM
settings. Since Wyng only handles data as disk blocks, it is also one of
the more secure options available.

adw · September 30, 2020, 12:03am

Yes, many TBs, in fact.

No clue what’s going on there.

Can you be more specific?

Let’s go with your scenario. Suppose there’s an activist or investigative journalist who has published some work that causes some very powerful people to feel threatened. So this person is, as you say, already “in the cross hairs” of some people with the means to do serious harm. Now, suppose that the person is contemplating whether she should store encrypted backups on a cloud storage service, such as Google Drive or AWS. As you point out, metadata is an important consideration. For example, an encrypted Qubes backup can be identified as such by its plaintext header.

Ok, so what’s the additional threat you have in mind here? Is it that her powerful adversaries might be able to ascertain that she uses Qubes and stores Qubes backup in these cloud storage services? We can even suppose that they’re able to access her cloud storage accounts and download copies without her knowledge. Unless her use of Qubes was supposed to remain a secret, it’s unclear what she has lost, since the backups are still encrypted. I’m just trying to understand which specific threats you have in mind and what you mean by your statements.

But you’re talking about properties of some system outside of Qubes that receives and stores Qubes backups. For example, one property you specified was, “Backup allows the user to lock certain backups, which will never be deleted unless the user explicitly says so.” That’s the type of control you find in AWS, for example. It’s not a property you get from a simple external hard drive formatted with FAT32. It’s not something that Qubes can enforce on external devices, such as that FAT32 hard drive, because someone can simply plug that hard drive into a Windows machine and delete whatever they want. Even if the entire drive is LUKS-encrypted, they can simply wipe the drive.

That’s not what I proposed. My proposal is simply that if you have bulk data that changes infrequently, it may be appropriate to store it in VMs that are infrequently backed up. This can cut down on the amount of data that has to be backed up frequently.

More complex setups are also possible. For example, @unman described his setup in which emails are stored in an offline backend VM and accessed by the MUA via qrexec (if I understood correctly). Something like this might be appropriate for you, given the complex setup you have.

qufo · September 30, 2020, 10:02am

You may want to check out GitHub - tasket/wyng-backup: Fast Time Machine-like backups for logical volumes & disk images

This was referenced earlier as ‘sparsebak’. Wyng can perform very
fast
incremental backups of large volumes, and it can also remove old
backups
from the archive quickly. You can also keep pushing backups to it
indefinitely as the next release candidate will monitor disk space
and
prune old data automatically to make room.

Wyng does not yet have integrated encryption (which can be currently
handled in dom0 by using an attached LUKS container) and it doesn’t
handle Qubes VM settings for you. I plan to address both these issues
in
the next iteration, starting with a wrapper script that handles VM
settings. Since Wyng only handles data as disk blocks, it is also one
of
the more secure options available.

That sounds interesting…Thanks for the tip.

qufo · September 30, 2020, 10:54am

qufo:

Did you try to upload let say 50GB heavily encrypted blog on Google-
Drive?

Yes, many TBs, in fact.

I believe you…I tested it today in the morning. Not such a big one
but a smaller one. So okay…that works; 15GB cloud storage max. Not
too bad, but still I don’t trust Google at all, since it’s known whom
they support and how they thing about privacy/free expression… but
that’s entirely different discussion.

qufo:

Google refused to store videos which I produced, and they weren’t
even encrypted at the time.

No clue what’s going on there.

I didn’t expect that you do. I just gave an example from my
experience.

qufo:

Depends…lets say there is activist or investigative journalist who
did certain work already, hence is already on the cross-hairs, then
it might not go well for that person. Keyword metadata: has a higher
value than content. Keep that in mind.

Can you be more specific?

Puh…that would blow the forum…key question you might ask yourself
are:

Metadata vs. Content?
Ubiquitous surveillance (location data via our phones, call logs,
emails, browsing history, contacts, etc. / surveillance capitalism). In
one word Panopticon

Recommended reading: The Black Mirror by Barton Gellman; Snowden’s
Biography, Glenn Greenwald’s No Place to hide, The Snowden Archive (on
theintercept.com), Wikileaks, EDRi.org, EFF.org, and so and so on.

Thing with all this is that it is abstract and invisible. Metadata is
key. The more metadata someone has the better and detailed the link-
ability is. Meaning that a adversary can predictive say whether
someone might do something that might be adverse to a certain policy
for example. That’s what’s already happened every day: people vanish,
get blackmailed, badmouthed (reputation murder), black sides around the
globe, states who support all this.

Let’s go with your scenario. Suppose there’s an activist or
investigative journalist who has published some work that causes some
very powerful people to feel threatened. So this person is, as you
say, already “in the cross hairs” of some people with the means to do
serious harm. Now, suppose that the person is contemplating whether
she should store encrypted backups on a cloud storage service, such
as Google Drive or AWS. As you point out, metadata is an important
consideration. For example, an encrypted Qubes backup can be
identified as such by its plaintext header.

The simple fact that you use encryption puts you on the list…or even
on a watch list. The NSA is assigning to every person a thread rating.
Nobody knows how that works. But they act on it.

Ok, so what’s the additional threat you have in mind here? Is it that
her powerful adversaries might be able to ascertain that she uses
Qubes and stores Qubes backup in these cloud storage services? We can
even suppose that they’re able to access her cloud storage accounts
and download copies without her knowledge. Unless her use of Qubes
was supposed to remain a secret, it’s unclear what she has lost,
since the backups are still encrypted. I’m just trying to understand
which specific threats you have in mind and what you mean by your
statements.

Who is using system likes Qubes-OS, encryption, GPG, Signal, even the
simple fact that someone uses open-source is enough. In Europe the
discuss again (behind closed doors) who to monitor communication. One
proposal again is to monitor directly at the source. What does that
mean? It means that on any communication device has to be a software
(more likely firmware) installed. That greps your message before you
send it, sends it to monitoring server, where it gets analyzed
(algorithmic or via a human or both) and if considered as okay gets
send to the intended receiver. If interested:

Short cut to the leaked document:
https://edri.org/wp-content/uploads/2020/09/SKM_C45820090717470-1.pdf

qufo:

I don’t think so. Since Qubes-OS is basically nothing else that
multiple computer in one, hence it has to be treated like set of
computers or traditionally like a computer network.

But you’re talking about properties of some system outside of Qubes
that receives and stores Qubes backups. For example, one property you
specified was, “Backup allows the user to lock certain backups, which
will never be deleted unless the user explicitly says so.” That’s the
type of control you find in AWS, for example. It’s not a property you
get from a simple external hard drive formatted with FAT32. It’s not
something that Qubes can enforce on external devices, such as that
FAT32 hard drive, because someone can simply plug that hard drive
into a Windows machine and delete whatever they want. Even if the
entire drive is LUKS-encrypted, they can simply wipe the drive.

Ah, I think here we have a clash of generations. Nope you do not
need the cloud do have that. For example software like timeshift,
backintime, even Apples TimeMachine can do it (argh…not locking). By
locking I mean simply that it can’t be deleted. In qubes-backup it
could be simple archived by adding to the archive name the letter “L”.
Then when next qubes-backup get started it would skip all archives
which end in “L”.

qufo:

So, according to your proposal I would have that some where neatly
stored away. So, but I use it on a daily basis. In order to have the
emails stored away I would need to maintain an off-line read-only
mail server, but then I can’t search across all my data and so on…

That’s not what I proposed. My proposal is simply that if you have
bulk data that changes infrequently, it may be appropriate to store
it in VMs that are infrequently backed up. This can cut down on the
amount of data that has to be backed up frequently.

More complex setups are also possible. For example, @unman described
his setup in which emails are stored in an offline backend VM and
accessed by the MUA via qrexec (if I understood correctly). Something
like this might be appropriate for you, given the complex setup you
have.

That’s it exactly. I do not have the time to write scripts. Currently I
do that, because every N years I replaced hardware and revisited my
strategy, system/software implementation.
But thanks for the tip. I haven’t checked qrexec it’s still on the list
unchecked.

adw · September 30, 2020, 4:09pm

It sounds like you may have lost the thread of the discussion. To recap:

I claimed that it’s fine to store encrypted backups in cloud storage as well as locally.
You said: “Depends… lets say there is activist or investigative journalist who did certain work already, hence is already on the cross-hairs, then it might not go well for that person […]”
I asked what you meant by that: What is the additional threat to this person from storing an encrypted backup in cloud storage? Is it just that now the adversary knows she’s a Qubes user, or is there something more?
You respond that simply using encryption and privacy tools is enough to put you on a list.

But remember that we’re assuming that this person is already “on a list” due to her published work, which threatens powerful people. That public work is already enough by itself to put her in their cross-hairs, regardless of her use of technology.

Now you’re gesturing vaguely toward the dangers of metadata in general. I’m well-aware of those dangers and certainly don’t deny them, but that doesn’t address the specific question I’m asking. I repeat: What exactly are the additional risks our hypothetical person incurs by storing encrypted backups in the cloud?

I think it’s also worth noting that some of your claims about metadata are too strong. For example, earlier you said:

Again, I’m well aware of the privacy-destroying power that can be wrought with metadata, but the unqualified claim that metadata has a higher value than content is dubious at best (especially when we don’t stop to ask “Valuable for whom?”). If I had to choose, I would rather keep the contents of my phone calls, emails, postal mail, medical records, and tax returns private than the metadata about that content. The supposition that metadata is always more valuable than content leads to absurd results: Encryption would have no reason to exist, since the encryption metadata would be more valuable than what it protects. Envelopes would be sealed inside of letters rather than the other way around.

And I think you assume too much.

I never claimed you do. AWS was simply an example of something that provides such control.

You just proved my point. Those are independent of Qubes.

That only prevents Qubes from deleting those backups. It doesn’t prevent those backups from being deleted. (Refer back to my example of the simple hard drive formatted with FAT32.)

wobo · October 3, 2020, 1:18pm

I am kind of weak in data integrity checks, sha256sums etc. Can someone point me to some good guides of data verification and integrity etc.

Surf · October 5, 2020, 4:58pm

When restoring backup VMs occasionally you have template VMs name changes such as fedora-30 to fedora-301. My question is: how would you delete the stock template installed during a fresh install. For example fedora-30 would be deleted, since I have fedora-301(which is the updated version). FYI, I verified all qubes that were based on fedora 30 and changed them to fedora 301 in qubes template manager. thanks,

adw · October 5, 2020, 10:21pm

In dom0:

sudo dnf remove qubes-template-fedora-30

deeplow · November 3, 2020, 9:14pm

Hey @tasket. How does one go about verifying your GPG key?

I’ve found none associated with your email.

qubes-curious · January 19, 2021, 9:42pm

Hi, I’m kind of late to the party, and must admit I only skimmed the thread.

Did you consider doing offline backups? Like shutting down the Qubes machine and just taking an image of the disk(s)?
With a deduplicating backup tool like restic (https://restic.net), this would not take up too much space.
If you have a 1TB disk, which is fully encrypted, you either have a first backup that takes 1TB (if you have actually written encrytped data to all sectors) or less (if only the used space looks like random data, and the not-yet-written sectors contain zeros. I actually don’t know if qubes does write all sectors during installation to hide information about data size…).
Every subsequent snapshot will only transfer sections that have changed, so the repository will only grow a little bit per snapshot. Of course it would have to read the whole disk again, but a simple sequential read (without decryption) of a disk is still faster than some of the time-constraint requirements that were mentioned, especially if the system disk is an SSD.

Due to the system being offline, there is also no attack vector to intercept information. The backup repository is trivially encrypted (because it’s an image of an encrypted drive).

I did not actually try it out yet for my qubes machine, but I’m using restic for backup of other machines here, and I’m planning on using it for qubes.

It would of course be nicer if I could use restic for single-vm backups, so a complete read of the disk can be skipped.

deeplow · January 20, 2021, 5:44pm

Interesting approach. Let me know how it goes!

qubes-curious · January 20, 2021, 8:52pm

I just did a test-run. I attached the 1TB SSD on which qubes is installed to a S-ATA to USB3 cable and to another machine. Then used restic to store a complete disk image. As I wrote the image to an NVMe SSD, this was really fast, it took about 42 minutes. The resulting restic repository had a size of 121GB, which means that all the not-yet-written-to areas of the qubes disk contained zeros and were deduplicated by restic.

A subsequent second snapshot (without booting up qubes inbetween, thus identical data) also took 42 minutes, and the repository stayed at 121GB.

The first snapshot would of course take a lot longer if the backup target was a HDD or some network drive.
But thanks to deduplication, every subsequent backup (as long as the amount of changed data is not huge) should roughly take those 42 min, as most of the time will be spent reading areas that have not changed and only writing tiny amounts of metadata. The limiting factor then is the sequential read speed of your disk, which hopefully is an SSD when running qubes

This way you can incrementally back up your whole system in a fixed amount of time. You could also play around with partitioning schemes or multiple disks to have a “split” backup: One smaller disk/partition that contains qubes with its VMs, which is then faster to backup via image, and another disk which keeps your data and will use file-based backup, as to not always read all of the disk.
Judging from my resulting repository, I could easily get away with a 250GB SSD for the system, which then would take about 10-15 minutes to image.

EDIT: Ha, now I know what to do with that spare 250GB SSD I have lying around brb, installing qubes…

EDIT 2: restoring a snapshot is slower, reading from the repository is at about 120MB/s. But you hopefully don’t have to do that too frequently.

deeplow · January 21, 2021, 10:06am

Wow. This is fast! My full 400GB backup takes around 6h-8h, I think!

Optimizing for the most common operations is great, even if that’s at the cost of “slowing down” non-common operations. So this is ideal, I think.

The only “deal-breaker” for some, may be the need to take it out. But I really like your solution overall. I’ll have to play around with it.

qubes-curious · January 21, 2021, 10:41am

There is no need to take it out. You could instead boot a USB live system (or do a network boot) to backup the installed SSD to an external device. I did it this way for testing because it was easier for me (i.e. I was too lazy to find a USB stick to boot from )

My thinkpads have a BIOS option to automatically boot from network if they were powered on via Wake-On-Lan. So I could imagine fully unattended nightly backups where a backup server wakes the machine, lets it boot an auto-backup-tool via PXE, which shuts down the machine after successful backup.
If such a backup server makes sense under some of the threat models that qubes addresses, is of course left to your own choice.

That is less than 20MB/s! Damn… you could write a full disk image via USB2 or a good internet connection in that time.

BenT · February 26, 2023, 5:47pm

Around 200GB of frequently used machines and another 300G of infrequently used / changing machines.

The qubes backup system is extremely slow for this. Was using Wyng backup to take LVM snapshots but have now moved to btrfs without lvm for other reasons.

Think a combination of physical backups with partclone and logical backups with the Qubes Backup-Restore tool is a potential way forward but wonder what others are doing.