Qubes Backup is Slow

qufo · September 26, 2020, 4:06pm

I strongly recommend the Qubes backup system, which was designed for
precisely this type of use case. It performs all of the required
authentication and encryption steps for you in a way that makes it
hard for you to mess up. (For example, if you’re not sure whether you
should encrypt-then-MAC or MAC-then-encrypt, or if you don’t see how
it makes a difference, then you should not be trying to take a DIY
approach to this stuff with sensitive data.) The Qubes backup system
is specifically designed for the scenario in which you need to back
up data to some untrusted location (e.g., cloud storage, non-
physically-secured external media), then restore it back into Qubes
later, in a way that preserves the security properties of both the
data and the Qubes installation.

But truth be told it’s slow, very slow. Currently I have only around
190GB (all VMs except of sys-usb. all other VMs are shutdown)to backup
to a external USB3 4TB disk via sys-usb, and it takes over 5 hours. I’m
not sure whether that will work when I get my new machine with 2TB
internally.

adw · September 26, 2020, 4:33pm

Depending on the type of data you’re backing up, I bet you can speed it up considerably by turning off compression (--no-compress).

qufo · September 26, 2020, 7:00pm

Depending on the type of data you’re backing up, I bet you can speed
it up considerably by turning off compression (--no-compress).

That’s what I did, but still. To speed up the backup process the backup
software needs to implement something like “rsync”, but that would mean
to create a virtual representation of the whole Qubes installation in
dom0, which then would be presented to qubes.backup. Quite a challenge,
but we have to come-up with something…

How often you guys do the backup, and how much data is it in GB?

deeplow · September 26, 2020, 8:30pm

github.com/QubesOS/qubes-issues

[Contribution] qubes-incremental-backup-poc OR Wyng backup

opened 05:14PM - 08 Mar 15 UTC

marmarek

T: enhancement P: major community dev S: needs review C: contrib package

**Community Devs:** @v6ak, @tasket **@v6ak's PoC:** https://github.com/v6ak/qub…es-incremental-backup-poc **@tasket's PoC:** https://github.com/tasket/wyng-backup | Status update as of 2022-08-16: https://github.com/QubesOS/qubes-issues/issues/858#issuecomment-1217463303 ----- **Reported by joanna on 14 May 2014 10:38 UTC** None Migrated-From: https://wiki.qubes-os.org/ticket/858 --- **Note to any contributors who wish to work on this issue:** Please either ask for details or propose a design before starting serious work on this.

deeplow · September 26, 2020, 8:34pm

Frequency: 1 month in theory (not so much in practice)
Time: 6-8 hours (overnight)
Size: 300 GBs

qufo · September 27, 2020, 9:14am

That means you haven’t a current backup. I guess you have a lot data in the cloud, right?

That’s not for me. I consider the cloud currently provided by the big players (Google, Apple, Amazon, etc.) as untrusted and extremely harmful. Even if one stores opaque blobs there, I’m not sure whether the above mentioned providers would allow it. And anyway, backups have to be done daily (first backup: full system; everything / subsequent backups: only the changes since last backup), stored offline, and accessible via a different system in case of disaster. The current Qubes-Backup does all of that, except that it is blind in regard to changes. It takes all, or chosen VMs, hence the user has to keep notes where and when s/he did change something. That’s a real problem.

Currently I think about a backup system that fulfills the following requirements:

No data gets exposed in dom0
Backup VM that collects all data by creating a virtual filesystem:

Backup-vFS397×559 10.6 KB
Backup creates hash-keys for all files and stores those keys on the Backup-VM for subsequent backups.
Creates a data stream (preserving the vfs structure) (maybe) via tar/cpio; pipes it into scrypt; scrypt writes to the external backup medium.
Subsequent backups would create hash-keys for all files and compare those with the ones from the previous backup. Then only files that have changed are copied via the previous described process.
In case of disaster the user should be able to extract the archive, which would recreate the vFS on an external storage device.
Backup checks beforehand whether there is enough space on the backup medium available. If there is not enough space it deletes the older ones by starting with the oldest until there is enough space.
Backup allows the user to lock certain backups, which will never be deleted unless the user explicitly says so.

Quite a wish list. On a traditional system no problem, simply use the Unix toolbox, but in Qubes-OS it’s difficult, but users need to be able to do daily backups. Storing data in the cloud can’t be seriously considered. Except they are providers who actually provide cloud storage, which is encrypted (keys are not with the provider). Alternative users (if technically fit enough, willing, and have the time) create and maintain there own cloud, or maybe a kind of dropbox/onion-share.

Thoughts?

deeplow · September 27, 2020, 9:32am

Most of this is just me fooling around with VM Templates and other projects. Nothing really too serious if gets lost. I don’t usually keep too much state on my systems (projects are archived offline) and a lot of the code is on git repos anyways (if you consider that “the cloud”, then yeah).

I’ve had qubes break on me on multiple accounts and not too much was lost. Not enough to make me do 300gb backups daily anyways.

But I do think the slow (and non incremental) backups are a serious issue that should be looked into (even more than new features). If it were easier I’d do it. But in the current state, it’s too much of a burden.

I guess it’s also an interesting challenge to try to get your data back after something breaks (at least the motivation is strong).

And I don’t think I’m alone on this. I’d go so far as to say the vast majority of users of qubes don’t do daily backups (not even weekly). But only user research will be able to answer these questions.

qufo · September 27, 2020, 1:44pm

Yep…definitely.

The Qubes-Backup delivers that already. Slow and cumbersome but it works. I had to do it, and I work/test with real data. Data lost = Zero.

Definitely not alone…I can tell for sure…I did data recovery / restore for a living in virtual environments (not Xen but VMware)…and usually even top sys-admins had to admit that they hadn’t current and more importantly tested backups.

So, to move forward we should define the requirements and limits for daily backups. I don’t drop daily, because consider our target group investigative journalists, activists, whistleblowers, and non-technical users. Imagine a journalist receives leaked documents (Snowden send Barton Gellman 50.000 documents in triple encrypted archive he called Pandora (documented in Dark Mirror by Barton Gellman)). Now imagine such a journalist uses Qubes without daily backup. Now you might interject and point out that such data should be stored on a offline medium. Yes, that’s true but as journalist how does he works with it? S/he has to unpack it, analyze it, and so on. Furthermore take in consideration that journalists work on multiple stories at a time (I was told up 12-15). They don’t juggle with pen-drives all the time that wouldn’t work…They might use Qubes for exactly the reason Qubes is build for; compartmentalization.

However, I would start with the following requirement/limits:

Target user: Laptop / Desktop users
Max data volume: a) laptop <= 2TB (that might be already too much. Do I really need to carry 2TB data with me?), b) desktop - probably more than 2TB or do the desktop users use RAID (external/internal?). In that case they probably configured at least RAID1 and could live without daily backups; maybe weekly?
Adhere to Qubes-OS security policy. No data get exposed to dom0, except opaque data blobs.
should be incremental. That means on file level.
shouldn’t take longer than 10 hours, except the first run.
Backup medium should be a external USB3 disk with a LUKS partition.
in case it takes longer (probably desktop user with bigger storage systems), then the backup system should have a configurable prioritization system, where the user can prioritize which files or folders shall be backed up daily, weekly, etc.
should be accessible on another Unix like system. So that in case of disaster the user can get her/his data back. Tails worked for me, but maybe the backup system should have an option to create a Qubes-OS disaster/recovery image, which contains all necessary tools, and can be flashed on an USB-stick. Such an image should be created together with the backup, but only if there were Qubes-OS updates that would have an impact on the recovery procedure.
I could also imagine something like Secure DropBox or OnionShare as backup medium, which could be setup as private cloud. I’m not really sure about that.

Something missing? Too much?

deeplow · September 27, 2020, 6:25pm

Thanks a lot of the insightful post @qufo!

I’d say anything that takes longer than 30-45 mins will be considered an overnight job. This also increases the risk of compromise since the user will have to leave the computer on during the night.

So this requirement should be reduced significantly or split into two modes of operation:

fast backup - user-defined prioritization policy - like the one you suggest (which will take 30-45mins)
full backup - a typical full backup (overnight job)

You’re probably aware of the Backup Emergency Restore (without Qubes). What you mean by this is something usable by the end-user, right?

I wouldn’t add this as a requirement, but rather a (possible) future feature.

I agree with all other requirements.

adw · September 27, 2020, 9:23pm

Daily, <15 GB.

Bulk data that changes infrequently is organized into separate VMs that are backed up less frequently than the small, frequently-changing, important ones.

Same, but there’s no problem with storing encrypted data there.

Of course they do.

Agreed, but this is compatible with also having encrypted backups in cloud storage. Cloud storage is by far the easiest way to maintain up-to-date offsite backups. (Again, just make sure you’re storing only encrypted data. All Qubes backups are encrypted.) In addition to offsite backups, you should also have local offline backups, of course.

That’s exactly what #858 is about.

A lot of this stuff is out of scope for Qubes. You’re talking about properties of some external backup service.

A git repo is fundamentally just a set of files that git manages. You can create a git repo on an offline machine and never upload it to a remote. Your remote can be another local machine not connected to the internet. You can also use a platform like GitHub or GitLab. Git is not inherently tied to the internet.

I’m guessing that you’re not generating hundreds of GBs of new data every day. Most of that data probably changes infrequently, in which case you can get a lot of mileage out of simply organizing your VMs better. Have a few large VMs for storing infrequently-changing bulk data. Back those up less frequently. Have smaller VMs for important data that changes frequently. Back those up daily.

You don’t know that you have a reliable backup system until you test this.

Yes, but this is true of computer users in general. If anything, Qubes users are probably more conscientious about it than the average person due to self-selection. It’s an interesting and long-observed psychological phenomenon that you can’t get most people to care about backups until they’ve personally experienced significant, permanent data loss (and, in some cases, not even then).

No special distro or image is needed. We already have detailed, step-by-step documentation on this, which you can test for yourself any time. @deeplow already shared it above, but here it is again: Emergency backup recovery (v4) | Qubes OS

Most of the other properties you mention are already present in the Qubes backup system.

That sounds like superstition/FUD.

panati · September 28, 2020, 4:54am

So you are saying here about data flow between VM in Qubes. Is it safe to do so? Is it okay to create a storage VM (offline) to send data from other VM there and back it up when needed?

deeplow · September 28, 2020, 6:26am

Yup. I use both! (online and offline git)

Thanks for the Tip. The problem arises when you need constant access to a pool of documents (such as a bibliography manager) and you never know which ones you’ll need. (and to which you’re constantly adding new ones). But I’ll try to implement this and see if works for my particular situation

So true.

Certainly FUD for most threat models. But Full Disk Encryption only works when the machine is offline and the ram is cleared.

For a threat-model high enough I don’t think it’s unreasonable to have these sort of considerations.

adw · September 28, 2020, 7:59am

Of course certain data transfers between VMs are safe, otherwise qvm-copy and qvm-move would have no reason to exist. Generally, the rule is that it’s safe to copy/move data from more trusted to less trusted VMs but not vice versa.

Two ideas:

Separate the bibliography manager from the underlying source files it tracks.
Use a qrexec service to access the source files in a backend VM.

Granted, if one of the threats you need to defend against is people breaking into your place while you’re asleep and physically restraining you while they use liquid nitrogen on your DIMMs as they cart your computer away to a lab, then yes, I suppose it’s important to make sure your drive is encrypted and keys are wiped from memory whenever you’re not awake and alert near the machine. At that point, though, I think physical security starts to trump digital security (i.e., rubber-hose cryptanalysis).

On a related note, encrypt-on-suspend might make it into Qubes at some point.

deeplow · September 28, 2020, 4:32pm

Great ideas!

Yep

Awesome! I didn’t know about that.

qufo · September 29, 2020, 9:38pm

Yeah…that would be great.

Yes.

Agreed.

qufo · September 29, 2020, 9:54pm

Did you try to upload let say 50GB heavily encrypted blog on Google-Drive? Google refused to store videos which I produced, and they weren’t even encrypted at the time.

Depends…lets say there is activist or investigative journalist who did certain work already, hence is already on the cross-hairs, then it might not go well for that person. Keyword metadata: has a higher value than content. Keep that in mind.

Yep, I know about that issue.

I don’t think so. Since Qubes-OS is basically nothing else that multiple computer in one, hence it has to be treated like set of computers or traditionally like a computer network.

A note to the notion of bulk data. Fact-checkers, investigators, analysts have their own libraries. For example: RSS feeds, mailing lists, articles (downloading articles is necessary because often enough articles simply vanish). Over time one creates a very valuable archive/library local. Meaning it’s always available. Only to give you an idea: I have a email collection of over 100k emails plus articles, slides videos etc. in total almos 150GB, and that’s without my library. So, according to your proposal I would have that some where neatly stored away. So, but I use it on a daily basis. In order to have the emails stored away I would need to maintain an off-line read-only mail server, but then I can’t search across all my data and so on… It’s read-only but active and increased daily by around 0.5GB.

tasket · September 29, 2020, 10:56pm

You may want to check out https://github.com/tasket/wyng-backup

This was referenced earlier as ‘sparsebak’. Wyng can perform very fast
incremental backups of large volumes, and it can also remove old backups
from the archive quickly. You can also keep pushing backups to it
indefinitely as the next release candidate will monitor disk space and
prune old data automatically to make room.

Wyng does not yet have integrated encryption (which can be currently
handled in dom0 by using an attached LUKS container) and it doesn’t
handle Qubes VM settings for you. I plan to address both these issues in
the next iteration, starting with a wrapper script that handles VM
settings. Since Wyng only handles data as disk blocks, it is also one of
the more secure options available.

adw · September 30, 2020, 12:03am

Yes, many TBs, in fact.

No clue what’s going on there.

Can you be more specific?

Let’s go with your scenario. Suppose there’s an activist or investigative journalist who has published some work that causes some very powerful people to feel threatened. So this person is, as you say, already “in the cross hairs” of some people with the means to do serious harm. Now, suppose that the person is contemplating whether she should store encrypted backups on a cloud storage service, such as Google Drive or AWS. As you point out, metadata is an important consideration. For example, an encrypted Qubes backup can be identified as such by its plaintext header.

Ok, so what’s the additional threat you have in mind here? Is it that her powerful adversaries might be able to ascertain that she uses Qubes and stores Qubes backup in these cloud storage services? We can even suppose that they’re able to access her cloud storage accounts and download copies without her knowledge. Unless her use of Qubes was supposed to remain a secret, it’s unclear what she has lost, since the backups are still encrypted. I’m just trying to understand which specific threats you have in mind and what you mean by your statements.

But you’re talking about properties of some system outside of Qubes that receives and stores Qubes backups. For example, one property you specified was, “Backup allows the user to lock certain backups, which will never be deleted unless the user explicitly says so.” That’s the type of control you find in AWS, for example. It’s not a property you get from a simple external hard drive formatted with FAT32. It’s not something that Qubes can enforce on external devices, such as that FAT32 hard drive, because someone can simply plug that hard drive into a Windows machine and delete whatever they want. Even if the entire drive is LUKS-encrypted, they can simply wipe the drive.

That’s not what I proposed. My proposal is simply that if you have bulk data that changes infrequently, it may be appropriate to store it in VMs that are infrequently backed up. This can cut down on the amount of data that has to be backed up frequently.

More complex setups are also possible. For example, @unman described his setup in which emails are stored in an offline backend VM and accessed by the MUA via qrexec (if I understood correctly). Something like this might be appropriate for you, given the complex setup you have.

qufo · September 30, 2020, 10:02am

You may want to check out GitHub - tasket/wyng-backup: Fast Time Machine-like backups for logical volumes & disk images

This was referenced earlier as ‘sparsebak’. Wyng can perform very
fast
incremental backups of large volumes, and it can also remove old
backups
from the archive quickly. You can also keep pushing backups to it
indefinitely as the next release candidate will monitor disk space
and
prune old data automatically to make room.

Wyng does not yet have integrated encryption (which can be currently
handled in dom0 by using an attached LUKS container) and it doesn’t
handle Qubes VM settings for you. I plan to address both these issues
in
the next iteration, starting with a wrapper script that handles VM
settings. Since Wyng only handles data as disk blocks, it is also one
of
the more secure options available.

That sounds interesting…Thanks for the tip.

qufo · September 30, 2020, 10:54am

qufo:

Did you try to upload let say 50GB heavily encrypted blog on Google-
Drive?

Yes, many TBs, in fact.

I believe you…I tested it today in the morning. Not such a big one
but a smaller one. So okay…that works; 15GB cloud storage max. Not
too bad, but still I don’t trust Google at all, since it’s known whom
they support and how they thing about privacy/free expression… but
that’s entirely different discussion.

qufo:

Google refused to store videos which I produced, and they weren’t
even encrypted at the time.

No clue what’s going on there.

I didn’t expect that you do. I just gave an example from my
experience.

qufo:

Depends…lets say there is activist or investigative journalist who
did certain work already, hence is already on the cross-hairs, then
it might not go well for that person. Keyword metadata: has a higher
value than content. Keep that in mind.

Can you be more specific?

Puh…that would blow the forum…key question you might ask yourself
are:

Metadata vs. Content?
Ubiquitous surveillance (location data via our phones, call logs,
emails, browsing history, contacts, etc. / surveillance capitalism). In
one word Panopticon

Recommended reading: The Black Mirror by Barton Gellman; Snowden’s
Biography, Glenn Greenwald’s No Place to hide, The Snowden Archive (on
theintercept.com), Wikileaks, EDRi.org, EFF.org, and so and so on.

Thing with all this is that it is abstract and invisible. Metadata is
key. The more metadata someone has the better and detailed the link-
ability is. Meaning that a adversary can predictive say whether
someone might do something that might be adverse to a certain policy
for example. That’s what’s already happened every day: people vanish,
get blackmailed, badmouthed (reputation murder), black sides around the
globe, states who support all this.

Let’s go with your scenario. Suppose there’s an activist or
investigative journalist who has published some work that causes some
very powerful people to feel threatened. So this person is, as you
say, already “in the cross hairs” of some people with the means to do
serious harm. Now, suppose that the person is contemplating whether
she should store encrypted backups on a cloud storage service, such
as Google Drive or AWS. As you point out, metadata is an important
consideration. For example, an encrypted Qubes backup can be
identified as such by its plaintext header.

The simple fact that you use encryption puts you on the list…or even
on a watch list. The NSA is assigning to every person a thread rating.
Nobody knows how that works. But they act on it.

Ok, so what’s the additional threat you have in mind here? Is it that
her powerful adversaries might be able to ascertain that she uses
Qubes and stores Qubes backup in these cloud storage services? We can
even suppose that they’re able to access her cloud storage accounts
and download copies without her knowledge. Unless her use of Qubes
was supposed to remain a secret, it’s unclear what she has lost,
since the backups are still encrypted. I’m just trying to understand
which specific threats you have in mind and what you mean by your
statements.

Who is using system likes Qubes-OS, encryption, GPG, Signal, even the
simple fact that someone uses open-source is enough. In Europe the
discuss again (behind closed doors) who to monitor communication. One
proposal again is to monitor directly at the source. What does that
mean? It means that on any communication device has to be a software
(more likely firmware) installed. That greps your message before you
send it, sends it to monitoring server, where it gets analyzed
(algorithmic or via a human or both) and if considered as okay gets
send to the intended receiver. If interested:

Short cut to the leaked document:
https://edri.org/wp-content/uploads/2020/09/SKM_C45820090717470-1.pdf

qufo:

I don’t think so. Since Qubes-OS is basically nothing else that
multiple computer in one, hence it has to be treated like set of
computers or traditionally like a computer network.

But you’re talking about properties of some system outside of Qubes
that receives and stores Qubes backups. For example, one property you
specified was, “Backup allows the user to lock certain backups, which
will never be deleted unless the user explicitly says so.” That’s the
type of control you find in AWS, for example. It’s not a property you
get from a simple external hard drive formatted with FAT32. It’s not
something that Qubes can enforce on external devices, such as that
FAT32 hard drive, because someone can simply plug that hard drive
into a Windows machine and delete whatever they want. Even if the
entire drive is LUKS-encrypted, they can simply wipe the drive.

Ah, I think here we have a clash of generations. Nope you do not
need the cloud do have that. For example software like timeshift,
backintime, even Apples TimeMachine can do it (argh…not locking). By
locking I mean simply that it can’t be deleted. In qubes-backup it
could be simple archived by adding to the archive name the letter “L”.
Then when next qubes-backup get started it would skip all archives
which end in “L”.

qufo:

So, according to your proposal I would have that some where neatly
stored away. So, but I use it on a daily basis. In order to have the
emails stored away I would need to maintain an off-line read-only
mail server, but then I can’t search across all my data and so on…

That’s not what I proposed. My proposal is simply that if you have
bulk data that changes infrequently, it may be appropriate to store
it in VMs that are infrequently backed up. This can cut down on the
amount of data that has to be backed up frequently.

More complex setups are also possible. For example, @unman described
his setup in which emails are stored in an offline backend VM and
accessed by the MUA via qrexec (if I understood correctly). Something
like this might be appropriate for you, given the complex setup you
have.

That’s it exactly. I do not have the time to write scripts. Currently I
do that, because every N years I replaced hardware and revisited my
strategy, system/software implementation.
But thanks for the tip. I haven’t checked qrexec it’s still on the list
unchecked.