Boot time of 25 minutes due to fragmentation of monero blockchain [solved]

thanks for “systemd-analyze blame”

Is that just looking at boot processes?

I’ve got qubesd.service at 15minutes
everything else over 1minute is qubes-vm auto starting different VMs

You might also want to use smartctl/gsmartcontrol to interrogate your drives for errors. Modern drives keep their own logs internal on the drive and the smartctl utility allows you to see the logs and statistics of various parameters such as # sectors relocated and failure prediction. If one drive is experiencing errors the whole system could slow down.

https://www.smartmontools.org/wiki/TocDoc

does

qubes-dom0-update --action=upgrade

helps?

@tempmail : I keep my system updated with the Qubes Updater and I know dom0 was updated in the last week or so, wouldn’t that do the same thing?

smartctl -all /dev/sdb

returns:

SATA Verison is: SATA 3.3, 6.0 Gb/s (current: 3.0Gb/s)

SMART overall-health self-assessment test result: PASSED

232 Available_Reservd_Space … 004(THRESH) Pre-Fail(TYPE)

241 Host_Writes_GiB … 53712 (RAW_VALUE)
242 Host_Reads_GiB … 115168 (RAW_VALUE)

/dev/sda (other half of RAID 1) gives all the same except:

241 Host_Writes_GiB … 65006 (RAW_VALUE)
242 Host_Reads_GiB … 160737 (RAW_VALUE)

systemd-analyze critical-chain

returns:

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @19min 18.870s
└─multi-user.target @19min 18.870s
  └─qubes-vm@myqube.service @15min 54.459s +3min 24.409s
    └─qubes-meminfo-writer-dom0.service @15min 54.431s
      └─qubes-core.service @15min 53.580s +843ms
        └─qubesd.service @5.573s +15min 47.998s
          └─basic.target @5.545s
            └─dbus-broker.service @5.458s +80ms
              └─dbus.socket @5.444s
                └─sysinit.target @5.441s
                  └─systemd-userdbd.service @17min 5.947s +115ms
                    └─systemd-userdbd.socket @1.296s
                      └─system.slice
                        └─-.slice

So it’s taking qubesd.service 15 minutes to start (completely)

Maybe there is something noticeable in log?
journalctl -u qubesd.service
You can also plot all systemd services boot sequence and time, maybe you’ll see some other problematic service:
systemd-analyze plot > plot.svg
Then copy plot.svg to some disposable vm and open in in browser:
qvm-copy-to-vm dispX plot.svg

The plot just seemed to show that qubesd.service took over 15 minutes again.

journalctl -u qubesd.service

gave helpful info though, for each reboot major stalls occur with this each time:

May 29 21:30:33 dom0 systemd[1]: Starting Qubes OS daemon...
May 29 21:35:06 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl'
May 29 21:35:09 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z~j3nypgxl' -> '/var/lib/qubes/appvms/crypto/private.img.44@2023-05-30T03:56:38Z'
May 29 21:40:01 dom0 qubesd[1974]: Removed file: '/var/lib/qubes/appvms/crypto/private.img.43@2023-05-22T21:55:43Z'
May 29 21:43:12 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-dirty.img' -> '/var/lib/qubes/appvms/crypto/private.img'
May 29 21:47:51 dom0 qubesd[1974]: Reflinked file: '/var/lib/qubes/appvms/crypto/private.img' -> '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi'
May 29 21:47:54 dom0 qubesd[1974]: Renamed file: '/var/lib/qubes/appvms/crypto/private-precache.img~4z4crrsi' -> '/var/lib/qubes/appvms/crypto/private-precache.img'
May 29 21:47:54 dom0 systemd[1]: Started Qubes OS daemon.

Interestingly, this same qube was related to problems I was having in the past and I had to rebuild it from scratch. Then I noticed this error coming up:

May 29 21:53:13 dom0 qrexec-policy-e[8670]: error calling qrexec-policy-agent in dom0
                                            Traceback (most recent call last):
                                              File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 146, in execute
                                                await super().execute(caller_ident)
                                              File "/usr/lib/python3.8/site-packages/qrexec/policy/parser.py", line 643, in execute
                                                raise ExecutionFailed("qrexec-client failed: {}".format(command))
                                            qrexec.exc.ExecutionFailed: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', 'dom0', '-c', 'SOCKET8,crypto,11', '-E', 'QUBESRPC qubes.WindowIconUpdater+ crypto keyword adminvm']
                                            
                                            During handling of the above exception, another exception occurred:
                                            
                                            Traceback (most recent call last):
                                              File "/usr/lib/python3.8/site-packages/qrexec/tools/qrexec_policy_exec.py", line 167, in notify
                                                await call_socket_service(guivm, service, source_domain, params)
                                              File "/usr/lib/python3.8/site-packages/qrexec/server.py", line 108, in call_socket_service_local
                                                reader, writer = await asyncio.open_unix_connection(path)
                                              File "/usr/lib64/python3.8/asyncio/streams.py", line 111, in open_unix_connection
                                                transport, _ = await loop.create_unix_connection(
                                              File "/usr/lib64/python3.8/asyncio/unix_events.py", line 244, in create_unix_connection
                                                await self.sock_connect(sock, path)
                                              File "/usr/lib64/python3.8/asyncio/selector_events.py", line 496, in sock_connect
                                                return await fut
                                              File "/usr/lib64/python3.8/asyncio/selector_events.py", line 501, in _sock_connect
                                                sock.connect(address)
                                            ConnectionRefusedError: [Errno 111] Connection refused
May 29 21:53:13 dom0 qrexec[8670]: qubes.WindowIconUpdater: crypto -> dom0: error while executing: qrexec-client failed: ['/usr/lib/qubes/qrexec-client', '-d', 'dom0', '-c', 'SOCKET8,crypto,11', '-E', 'QUBESRPC qubes.WindowIconUpdater+ crypto keyword adminvm']

Which made me think the name of the qube, “crypto” was a reserved word or something? But later on I see these same errors for all qubes that autostart on boot. The only difference is the “crypto” qube error will show up at a different time than the others, which all occur sequentially (within a few seconds).

Maybe related issue:

yep, looks like same error

Though to be clear, those errors are happening after the long delays, so I’m thinking they are not directly related?

You could try and disable autostart, and manually start each qube without starting the crypto qube.

Probably a monstrously fragmented volume file, try sudo filefrag /var/lib/qubes/appvms/crypto/private.img to see. Full node blockchain implementations and especially monerod can be terrible!

I’ve been meaning to write up how to defragment file-reflink volumes on Btrfs. I’m not very familiar with XFS, although if you temporarily have enough free disk space in dom0 I guess you could shut down the VM, wait for its private-dirty.img to disappear, delete the @ revision .img files and private-precache.img, then cp --sparse=always --reflink=never private.img to a new file, and rename the copy to the original file name.

2 Likes

yeah, that’s exactly what this qube is doing, the monero blockchain

sudo filefrag /var/lib/qubes/appvms/crypto/private.img

returns:

/var/lib/qubes/appvms/crypto/private.img: 6075181 extents found

Ok, so those would be the proper steps to make a new copy of that private.img, and then it would be defragmented. I will consider trying that later.

sudo xfs_fsr /var/lib/qubes/appvms/crypto/private.img might be a more efficient alternative to the final copy and rename steps:

https://www.man7.org/linux/man-pages/man8/xfs_fsr.8.html

For comparison, I see ~ 31,000 extents right after defragmenting a Monero VM’s private.img on Btrfs.

Good idea. I did that (turned off autostart for qube named “crypto”). Interestingly the first reboot still took over 20 minutes and checking journalctl -u qubesd.service still showed the same lines causing the delay even though crypto did not start after boot. But a second reboot seemed to address that, and the OS started in less than 4 minutes.

Then I started crypto manually and it took 4 minutes itself, which is long, but much shorter than the delay it was causing in the boot sequence.

After running that I get 1572484 extents found. About 1/4 of what was there, but still a lot more than yours. manpages describes the time limit in a way that makes me think running it again could continue to improve it, so will try that.

EDIT: ran it second time, it ran much shorter time and reduced to 18167 extents found

Thanks all, this has been solved. In summary this was a qube hosting the monero blockchain, which is apparently prone to high fragmentation, so the solution was to defragment that qube.

I tracked it down starting with @flavio’s suggestion to use systemd-analyze blame, which blamed qubesd.

Then used @disp6252’s suggestion of journalctl -u qubesd.service to see more details that it was when moving around *.img files related to the qube “crypto” that the delays occurred.

@renehoj’s suggestion to not autostart that one qube actually fixed my long system boot time.

But the real solution was to defrag the qube. And since I’m using the filesystem XFS, @rustybird’s suggestion solved the problem:

sudo xfs_fsr /var/lib/qubes/appvms/crypto/private.img

I ran this on one other qube that seemed highly fragmented, and can now boot in under 3 minutes.

Thanks to all!

2 Likes

I’m glad you managed to resolve the issue! And it’s a good learning experience too!

Would this strategy with cp work for btrfs as well? Thanks!

It probably would, but instead of the final cp and rename steps I use btrfs filesystem defragment -f private.img (like xfs_fsr private.img for XFS).

1 Like