Boot time of 25 minutes due to fragmentation of monero blockchain [solved]

scallyob · May 30, 2023, 3:41pm

Using a librebooted Asus D-16

Boot times with Qubes have always been slow, but as of recently, it takes me 25 minutes to get to the login screen! With the typical monthly crash in the middle of working, that can really mess up a day. I’ve also always had issues with USB devices working that sometimes require 3 reboots for things to work again. So that can lead to an 1.5 hrs to get my system back up as happened to me yesterday.

The USB issues date back to Qubes 3.2 so i’ve just come to live with them. I’ve done a few fresh installs of Qubes in the last year or two. And it is LUKS encrypted, RAID1 on XFS, as XFS seemed to be the only filesystem that worked to install as an encrypted RAID1 using Qubes 4.1 (see my previous threads).

If anyone has any tips on improving or debugging this i’d appreciate it.

slcoleman · May 31, 2023, 9:37pm

You did not see you say anything about the spindle or cache size/speed of the drives you are using. With software RAID 1 your performance will be determined by the slowest drive and then everything is written twice, adding the software encryption on top.

Use the fastest drives you have. I origionally had three spinning rust drives on my machine and it was really trying my patients with the long bootup times. I switched to SSD’s and never looked back. What a difference! If you have the resources to spare look into trying some SSD’s. You can temporarily remove your current drives and try it, and if it doesn’t work to your satisfaction then go back to the old drives. If it works you can use the old drives for nightly backups instead.

If you don’t have a dedicated hardware RAID controller card for the raid/encryption then look into some hardware to speed that part up.

scallyob · May 31, 2023, 11:25pm

The RAID is on two WD Blue SSDs.
It is a software RAID.

disp6252 · June 1, 2023, 3:20am

Check your disk read/write speed.

flavio · June 1, 2023, 1:04pm

I’ve seen slow boot times whenever ACPI errors were produced and the boot process needed to timeout from those errors. In my particular case, the fix was quite simple: disable SMT in the BIOS/UEFI. Did you look at the boot logs to see what accounts for most of the boot time? “systemd-analyze blame” or similar can help you pinpoint the culprit(s).

scallyob · June 1, 2023, 2:01pm

thanks for “systemd-analyze blame”

Is that just looking at boot processes?

I’ve got qubesd.service at 15minutes
everything else over 1minute is qubes-vm auto starting different VMs

slcoleman · June 2, 2023, 5:50pm

You might also want to use smartctl/gsmartcontrol to interrogate your drives for errors. Modern drives keep their own logs internal on the drive and the smartctl utility allows you to see the logs and statistics of various parameters such as # sectors relocated and failure prediction. If one drive is experiencing errors the whole system could slow down.

https://www.smartmontools.org/wiki/TocDoc

tempmail · June 2, 2023, 6:04pm

does

qubes-dom0-update --action=upgrade

helps?

scallyob · June 2, 2023, 6:10pm

@tempmail : I keep my system updated with the Qubes Updater and I know dom0 was updated in the last week or so, wouldn’t that do the same thing?

scallyob · June 2, 2023, 6:15pm

smartctl -all /dev/sdb

returns:

SATA Verison is: SATA 3.3, 6.0 Gb/s (current: 3.0Gb/s)
…
SMART overall-health self-assessment test result: PASSED
…
232 Available_Reservd_Space … 004(THRESH) Pre-Fail(TYPE)
…
241 Host_Writes_GiB … 53712 (RAW_VALUE)
242 Host_Reads_GiB … 115168 (RAW_VALUE)

/dev/sda (other half of RAID 1) gives all the same except:

241 Host_Writes_GiB … 65006 (RAW_VALUE)
242 Host_Reads_GiB … 160737 (RAW_VALUE)

scallyob · June 2, 2023, 6:20pm

systemd-analyze critical-chain

returns:

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @19min 18.870s
└─multi-user.target @19min 18.870s
  └─qubes-vm@myqube.service @15min 54.459s +3min 24.409s
    └─qubes-meminfo-writer-dom0.service @15min 54.431s
      └─qubes-core.service @15min 53.580s +843ms
        └─qubesd.service @5.573s +15min 47.998s
          └─basic.target @5.545s
            └─dbus-broker.service @5.458s +80ms
              └─dbus.socket @5.444s
                └─sysinit.target @5.441s
                  └─systemd-userdbd.service @17min 5.947s +115ms
                    └─systemd-userdbd.socket @1.296s
                      └─system.slice
                        └─-.slice

So it’s taking qubesd.service 15 minutes to start (completely)

disp6252 · June 2, 2023, 6:47pm

Maybe there is something noticeable in log?
journalctl -u qubesd.service
You can also plot all systemd services boot sequence and time, maybe you’ll see some other problematic service:
systemd-analyze plot > plot.svg
Then copy plot.svg to some disposable vm and open in in browser:
qvm-copy-to-vm dispX plot.svg

scallyob · June 2, 2023, 8:38pm

The plot just seemed to show that qubesd.service took over 15 minutes again.

journalctl -u qubesd.service

gave helpful info though, for each reboot major stalls occur with this each time:

May 29 21:30:33 dom0 systemd[1]: Starting Qubes OS daemon...
May 29 21:35:06 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl'
May 29 21:35:09 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z'
May 29 21:40:01 dom0 qubesd[1974]: Removed file: '/var/lib/qubes/appvms/crypto/private.img.43@2023-05-22T21:55:43Z'
May 29 21:43:12 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-dirty.img' -> '/var/lib/qubes/appvms/crypto/private.img'
May 29 21:47:51 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi'
May 29 21:47:54 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi' -> '/var/lib/qubes/appvms/crypto/private-precache.img'
May 29 21:47:54 dom0 systemd[1]: Started Qubes OS daemon.

Interestingly, this same qube was related to problems I was having in the past and I had to rebuild it from scratch. Then I noticed this error coming up:

May 29 21:53:13 dom0 qrexec-policy-e[8670]: error calling qrexec-policy-agent in dom0
                                            Traceback (most recent call last):
                                              File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 146, in execute
                                                await super().execute(caller_ident)
                                              File "/usr/lib/python3.8/site-packages/qrexec/policy/parser.py", line 643, in execute
                                                raise ExecutionFailed("qrexec-client failed: {}".format(command))
                                            qrexec.exc.ExecutionFailed: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', 'dom0', '-c', 'SOCKET8,crypto,11', '-E', 'QUBESRPC qubes.WindowIconUpdater+ crypto keyword adminvm']
                                            
                                            During handling of the above exception, another exception occurred:
                                            
                                            Traceback (most recent call last):
                                              File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 167, in notify
                                                await call_socket_service(guivm, service, source_domain, params)
                                              File "/usr/lib/python3.8/site-packages/qrexec/server.py", line 108, in call_socket_service_local
                                                reader, writer = await asyncio.open_unix_connection(path)
                                              File "/usr/lib64/python3.8/asyncio/streams.py", line 111, in open_unix_connection
                                                transport, _ = await loop.create_unix_connection(
                                              File "/usr/lib64/python3.8/asyncio/unix_events.py", line 244, in create_unix_connection
                                                await self.sock_connect(sock, path)
                                              File "/usr/lib64/python3.8/asyncio/selector_events.py", line 496, in sock_connect
                                                return await fut
                                              File "/usr/lib64/python3.8/asyncio/selector_events.py", line 501, in _sock_connect
                                                sock.connect(address)
                                            ConnectionRefusedError: [Errno 111] Connection refused
May 29 21:53:13 dom0 qrexec[8670]: qubes.WindowIconUpdater: crypto -> dom0: error while executing: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', 'dom0', '-c', 'SOCKET8,crypto,11', '-E', 'QUBESRPC qubes.WindowIconUpdater+ crypto keyword adminvm']

Which made me think the name of the qube, “crypto” was a reserved word or something? But later on I see these same errors for all qubes that autostart on boot. The only difference is the “crypto” qube error will show up at a different time than the others, which all occur sequentially (within a few seconds).

disp6252 · June 3, 2023, 12:20am

Maybe related issue:

github.com/QubesOS/qubes-issues

qrexec-daemon's socket in /var/run/qubes removed

opened 12:58PM - 07 Jun 20 UTC

marmarek

T: bug C: core P: major needs diagnosis

**Qubes OS version** R4.1 **Affected component(s) or functionality** qrexec… **Brief summary** After some operation (unfortunately I don't know exactly what), I've got qrexec-daemon sockets in `/var/run/qubes` in dom0 removed, although `qrexec-daemon` processes itself were running. **To Reproduce** _Steps to reproduce the behavior:_ _1. ??_ _2. Try to use qvm-run_ **Expected behavior** Working qrexec / qvm-run **Actual behavior** qrexec-client says `connect: No such file or directory`, while target domain is still running and its `qrexec-daemon` process is running too. **Additional context** About that time I've also got this message in journalctl in dom0, for all affected VMs: ``` qrexec[...]: qubes.WindowIconUpdater+: sys-net -> dom0: error while executing: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', '@adminvm', '-c', 'SOCKET12,sys-net,5', '-E', 'QUBESRPC qubes.WindowIconUpdater+ sys-net keyword adminvm'] ``` I suspect the socket got removed by `atexit(unlink_qrexec_socket)` handler in a child process (that called `exit()` instead of `_exit()`), but I haven't found which one. @pwmarcz any idea?

scallyob · June 3, 2023, 1:21am

yep, looks like same error

Though to be clear, those errors are happening after the long delays, so I’m thinking they are not directly related?

renehoj · June 3, 2023, 4:56am

You could try and disable autostart, and manually start each qube without starting the crypto qube.

rustybird · June 4, 2023, 5:36pm

scallyob:

May 29 21:30:33 dom0 systemd[1]: Starting Qubes OS daemon...
May 29 21:35:06 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl'
May 29 21:35:09 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z'
May 29 21:40:01 dom0 qubesd[1974]: Removed file: '/var/lib/qubes/appvms/crypto/private.img.43@2023-05-22T21:55:43Z'
May 29 21:43:12 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-dirty.img' -> '/var/lib/qubes/appvms/crypto/private.img'
May 29 21:47:51 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi'
May 29 21:47:54 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi' -> '/var/lib/qubes/appvms/crypto/private-precache.img'
May 29 21:47:54 dom0 systemd[1]: Started Qubes OS daemon.

Probably a monstrously fragmented volume file, try sudo filefrag /var/lib/qubes/appvms/crypto/private.img to see. Full node blockchain implementations and especially monerod can be terrible!

I’ve been meaning to write up how to defragment file-reflink volumes on Btrfs. I’m not very familiar with XFS, although if you temporarily have enough free disk space in dom0 I guess you could shut down the VM, wait for its private-dirty.img to disappear, delete the @ revision .img files and private-precache.img, then cp --sparse=always --reflink=never private.img to a new file, and rename the copy to the original file name.

scallyob · June 4, 2023, 7:48pm

yeah, that’s exactly what this qube is doing, the monero blockchain

sudo filefrag /var/lib/qubes/appvms/crypto/private.img

returns:

/var/lib/qubes/appvms/crypto/private.img: 6075181 extents found

Ok, so those would be the proper steps to make a new copy of that private.img, and then it would be defragmented. I will consider trying that later.

rustybird · June 4, 2023, 8:14pm

sudo xfs_fsr /var/lib/qubes/appvms/crypto/private.img might be a more efficient alternative to the final copy and rename steps:

https://www.man7.org/linux/man-pages/man8/xfs_fsr.8.html

For comparison, I see ~ 31,000 extents right after defragmenting a Monero VM’s private.img on Btrfs.

scallyob · June 4, 2023, 8:42pm

Good idea. I did that (turned off autostart for qube named “crypto”). Interestingly the first reboot still took over 20 minutes and checking journalctl -u qubesd.service still showed the same lines causing the delay even though crypto did not start after boot. But a second reboot seemed to address that, and the OS started in less than 4 minutes.

Then I started crypto manually and it took 4 minutes itself, which is long, but much shorter than the delay it was causing in the boot sequence.