Boot time of 25 minutes due to fragmentation of monero blockchain [solved]

Using a librebooted Asus D-16

Boot times with Qubes have always been slow, but as of recently, it takes me 25 minutes to get to the login screen! With the typical monthly crash in the middle of working, that can really mess up a day. I’ve also always had issues with USB devices working that sometimes require 3 reboots for things to work again. So that can lead to an 1.5 hrs to get my system back up as happened to me yesterday.

The USB issues date back to Qubes 3.2 so i’ve just come to live with them. I’ve done a few fresh installs of Qubes in the last year or two. And it is LUKS encrypted, RAID1 on XFS, as XFS seemed to be the only filesystem that worked to install as an encrypted RAID1 using Qubes 4.1 (see my previous threads).

If anyone has any tips on improving or debugging this i’d appreciate it.

You did not see you say anything about the spindle or cache size/speed of the drives you are using. With software RAID 1 your performance will be determined by the slowest drive and then everything is written twice, adding the software encryption on top.

Use the fastest drives you have. I origionally had three spinning rust drives on my machine and it was really trying my patients with the long bootup times. I switched to SSD’s and never looked back. What a difference! If you have the resources to spare look into trying some SSD’s. You can temporarily remove your current drives and try it, and if it doesn’t work to your satisfaction then go back to the old drives. If it works you can use the old drives for nightly backups instead.

If you don’t have a dedicated hardware RAID controller card for the raid/encryption then look into some hardware to speed that part up.

The RAID is on two WD Blue SSDs.
It is a software RAID.

Check your disk read/write speed.

I’ve seen slow boot times whenever ACPI errors were produced and the boot process needed to timeout from those errors. In my particular case, the fix was quite simple: disable SMT in the BIOS/UEFI. Did you look at the boot logs to see what accounts for most of the boot time? “systemd-analyze blame” or similar can help you pinpoint the culprit(s).

thanks for “systemd-analyze blame”

Is that just looking at boot processes?

I’ve got qubesd.service at 15minutes
everything else over 1minute is qubes-vm auto starting different VMs

You might also want to use smartctl/gsmartcontrol to interrogate your drives for errors. Modern drives keep their own logs internal on the drive and the smartctl utility allows you to see the logs and statistics of various parameters such as # sectors relocated and failure prediction. If one drive is experiencing errors the whole system could slow down.

https://www.smartmontools.org/wiki/TocDoc

does

qubes-dom0-update --action=upgrade

helps?

@tempmail : I keep my system updated with the Qubes Updater and I know dom0 was updated in the last week or so, wouldn’t that do the same thing?

smartctl -all /dev/sdb

returns:

SATA Verison is: SATA 3.3, 6.0 Gb/s (current: 3.0Gb/s)

SMART overall-health self-assessment test result: PASSED

232 Available_Reservd_Space … 004(THRESH) Pre-Fail(TYPE)

241 Host_Writes_GiB … 53712 (RAW_VALUE)
242 Host_Reads_GiB … 115168 (RAW_VALUE)

/dev/sda (other half of RAID 1) gives all the same except:

241 Host_Writes_GiB … 65006 (RAW_VALUE)
242 Host_Reads_GiB … 160737 (RAW_VALUE)

systemd-analyze critical-chain

returns:

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @19min 18.870s
└─multi-user.target @19min 18.870s
  └─qubes-vm@myqube.service @15min 54.459s +3min 24.409s
    └─qubes-meminfo-writer-dom0.service @15min 54.431s
      └─qubes-core.service @15min 53.580s +843ms
        └─qubesd.service @5.573s +15min 47.998s
          └─basic.target @5.545s
            └─dbus-broker.service @5.458s +80ms
              └─dbus.socket @5.444s
                └─sysinit.target @5.441s
                  └─systemd-userdbd.service @17min 5.947s +115ms
                    └─systemd-userdbd.socket @1.296s
                      └─system.slice
                        └─-.slice

So it’s taking qubesd.service 15 minutes to start (completely)

Maybe there is something noticeable in log?
journalctl -u qubesd.service
You can also plot all systemd services boot sequence and time, maybe you’ll see some other problematic service:
systemd-analyze plot > plot.svg
Then copy plot.svg to some disposable vm and open in in browser:
qvm-copy-to-vm dispX plot.svg

The plot just seemed to show that qubesd.service took over 15 minutes again.

journalctl -u qubesd.service

gave helpful info though, for each reboot major stalls occur with this each time:

May 29 21:30:33 dom0 systemd[1]: Starting Qubes OS daemon...
May 29 21:35:06 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl'
May 29 21:35:09 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z'
May 29 21:40:01 dom0 qubesd[1974]: Removed file: '/var/lib/qubes/appvms/crypto/private.img.43@2023-05-22T21:55:43Z'
May 29 21:43:12 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-dirty.img' -> '/var/lib/qubes/appvms/crypto/private.img'
May 29 21:47:51 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi'
May 29 21:47:54 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi' -> '/var/lib/qubes/appvms/crypto/private-precache.img'
May 29 21:47:54 dom0 systemd[1]: Started Qubes OS daemon.

Interestingly, this same qube was related to problems I was having in the past and I had to rebuild it from scratch. Then I noticed this error coming up:

May 29 21:53:13 dom0 qrexec-policy-e[8670]: error calling qrexec-policy-agent in dom0
                                            Traceback (most recent call last):
                                              File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 146, in execute
                                                await super().execute(caller_ident)
                                              File "/usr/lib/python3.8/site-packages/qrexec/policy/parser.py", line 643, in execute
                                                raise ExecutionFailed("qrexec-client failed: {}".format(command))
                                            qrexec.exc.ExecutionFailed: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', 'dom0', '-c', 'SOCKET8,crypto,11', '-E', 'QUBESRPC qubes.WindowIconUpdater+ crypto keyword adminvm']
                                            
                                            During handling of the above exception, another exception occurred:
                                            
                                            Traceback (most recent call last):
                                              File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 167, in notify
                                                await call_socket_service(guivm, service, source_domain, params)
                                              File "/usr/lib/python3.8/site-packages/qrexec/server.py", line 108, in call_socket_service_local
                                                reader, writer = await asyncio.open_unix_connection(path)
                                              File "/usr/lib64/python3.8/asyncio/streams.py", line 111, in open_unix_connection
                                                transport, _ = await loop.create_unix_connection(
                                              File "/usr/lib64/python3.8/asyncio/unix_events.py", line 244, in create_unix_connection
                                                await self.sock_connect(sock, path)
                                              File "/usr/lib64/python3.8/asyncio/selector_events.py", line 496, in sock_connect
                                                return await fut
                                              File "/usr/lib64/python3.8/asyncio/selector_events.py", line 501, in _sock_connect
                                                sock.connect(address)
                                            ConnectionRefusedError: [Errno 111] Connection refused
May 29 21:53:13 dom0 qrexec[8670]: qubes.WindowIconUpdater: crypto -> dom0: error while executing: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', 'dom0', '-c', 'SOCKET8,crypto,11', '-E', 'QUBESRPC qubes.WindowIconUpdater+ crypto keyword adminvm']

Which made me think the name of the qube, “crypto” was a reserved word or something? But later on I see these same errors for all qubes that autostart on boot. The only difference is the “crypto” qube error will show up at a different time than the others, which all occur sequentially (within a few seconds).

Maybe related issue:

yep, looks like same error

Though to be clear, those errors are happening after the long delays, so I’m thinking they are not directly related?

You could try and disable autostart, and manually start each qube without starting the crypto qube.

Probably a monstrously fragmented volume file, try sudo filefrag /var/lib/qubes/appvms/crypto/private.img to see. Full node blockchain implementations and especially monerod can be terrible!

I’ve been meaning to write up how to defragment file-reflink volumes on Btrfs. I’m not very familiar with XFS, although if you temporarily have enough free disk space in dom0 I guess you could shut down the VM, wait for its private-dirty.img to disappear, delete the @ revision .img files and private-precache.img, then cp --sparse=always --reflink=never private.img to a new file, and rename the copy to the original file name.

2 Likes

yeah, that’s exactly what this qube is doing, the monero blockchain

sudo filefrag /var/lib/qubes/appvms/crypto/private.img

returns:

/var/lib/qubes/appvms/crypto/private.img: 6075181 extents found

Ok, so those would be the proper steps to make a new copy of that private.img, and then it would be defragmented. I will consider trying that later.

sudo xfs_fsr /var/lib/qubes/appvms/crypto/private.img might be a more efficient alternative to the final copy and rename steps:

https://www.man7.org/linux/man-pages/man8/xfs_fsr.8.html

For comparison, I see ~ 31,000 extents right after defragmenting a Monero VM’s private.img on Btrfs.

Good idea. I did that (turned off autostart for qube named “crypto”). Interestingly the first reboot still took over 20 minutes and checking journalctl -u qubesd.service still showed the same lines causing the delay even though crypto did not start after boot. But a second reboot seemed to address that, and the OS started in less than 4 minutes.

Then I started crypto manually and it took 4 minutes itself, which is long, but much shorter than the delay it was causing in the boot sequence.