Analysis of a disk issue

ludovic · September 29, 2024, 2:02pm

TL;DR I didn’t get the 4.2.3 update because my dom0 crond daemon creates a lot of coredumps. Why? Just a basic disk issue

No 4.2.3 update

I was surprise to not see the 4.2.3 update. I checked and the qubes-release-4.2-9.fc37.src.rpm has been publish in the current repository. So I search why I didn’t get the update.
/etc/cron.daily/qubes-dom0-updates.cron should check for dom0 updates. It runs qubes-dom0-update --check-only. I checked, and the qubes-dom0-update --check-only finds the qubes-release update:

[user@dom0 ~]$ qubes-dom0-update --check-only
...
qubes-release.noarch                         4.2-9.fc37               qubes-dom0-current
...

So why doesn’t /etc/cron.daily/qubes-dom0-updates.cron work?

Crond creates core dumps

The crond daemon runs the `/etc/cron.daily/’ tasks. In the logs I see that the daemon creates a lot of core dumps:

[user@dom0 ~]$ journalctl -fu crond
Sep 29 10:44:37 dom0 systemd[1]: crond.service: Scheduled restart job, restart counter is at 189.
Sep 29 10:44:37 dom0 systemd[1]: Stopped crond.service - Command Scheduler.
Sep 29 10:44:37 dom0 systemd[1]: Started crond.service - Command Scheduler.
Sep 29 10:44:37 dom0 systemd[1]: crond.service: Main process exited, code=dumped, status=7/BUS
Sep 29 10:44:37 dom0 systemd[1]: crond.service: Failed with result 'core-dump'.
Sep 29 10:45:08 dom0 systemd[1]: crond.service: Scheduled restart job, restart counter is at 190.
Sep 29 10:45:08 dom0 systemd[1]: Stopped crond.service - Command Scheduler.
Sep 29 10:45:08 dom0 systemd[1]: Started crond.service - Command Scheduler.
Sep 29 10:45:08 dom0 systemd[1]: crond.service: Main process exited, code=dumped, status=7/BUS
Sep 29 10:45:08 dom0 systemd[1]: crond.service: Failed with result 'core-dump'.
Sep 29 10:45:38 dom0 systemd[1]: crond.service: Scheduled restart job, restart counter is at 191.
Sep 29 10:45:38 dom0 systemd[1]: Stopped crond.service - Command Scheduler.
Sep 29 10:45:38 dom0 systemd[1]: Started crond.service - Command Scheduler.
Sep 29 10:45:38 dom0 systemd[1]: crond.service: Main process exited, code=dumped, status=7/BUS
Sep 29 10:45:38 dom0 systemd[1]: crond.service: Failed with result 'core-dump'.
^C
[user@dom0 ~]$ systemctl status crond
● crond.service - Command Scheduler
     Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: core-dump) since Sun 2024-09-29 10:45:38 UTC; 13s ago
    Process: 17486 ExecStart=/usr/sbin/crond -n $CRONDARGS (code=dumped, signal=BUS)
   Main PID: 17486 (code=dumped, signal=BUS)
        CPU: 10ms

Sep 29 10:45:38 dom0 systemd[1]: crond.service: Failed with result 'core-dump'.

Core dump frequency and content

The coredumpctl list command shows all the core dumps, so I get a crond core dump each 30s:

[user@dom0 ~]$ sudo coredumpctl list
...
Sun 2024-09-29 13:13:22 UTC 21768    0    0 SIGBUS  present  /usr/sbin/crond      43.2K
Sun 2024-09-29 13:13:52 UTC 21777    0    0 SIGBUS  present  /usr/sbin/crond      43.2K
Sun 2024-09-29 13:14:23 UTC 21789    0    0 SIGBUS  present  /usr/sbin/crond      43.2K
Sun 2024-09-29 13:14:53 UTC 21806    0    0 SIGBUS  present  /usr/sbin/crond      43.2K
Sun 2024-09-29 13:15:23 UTC 21830    0    0 SIGBUS  present  /usr/sbin/crond      43.3K

The coredumpctl info command shows the last core dump:

[user@dom0 ~]$ sudo coredumpctl info
          PID: 21950 (crond)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 7 (BUS)
     Timestamp: Sun 2024-09-29 13:20:25 UTC (29s ago)
  Command Line: /usr/sbin/crond -n
    Executable: /usr/sbin/crond
 Control Group: /system.slice/crond.service
          Unit: crond.service
         Slice: system.slice
       Boot ID: 58cd9555d8b64bf495f4359ef6306bfc
    Machine ID: 38aab07ddc4b4ed488f3f81290b88c77
      Hostname: dom0
       Storage: /var/lib/systemd/coredump/core.crond.0.58cd9555d8b64bf495f4359ef6306bfc.21950.1727616025000000.>
     Disk Size: 43.2K
       Package: cronie/1.6.1-3.fc37
      build-id: 83da550dea1764b03a0f041494e9c1df758ce1ca
       Message: Process 21950 (crond) of user 0 dumped core.
                
                Module linux-vdso.so.1 with build-id bb05304aa2fa20cc6c55d8d7fc5dda2a83ad8aac

....

               Stack trace of thread 21950:
                #0  0x00005b0d124da000 _init (crond + 0x3000)
                ELF object binary architecture: AMD x86-64

The crond executable creates a small stack trace. It crashed in the _init() function. So maybe a problem with my crond setting files… I will search in this way.

crond binary origin

The /usr/sbin/crond executable comes from the cronie-1.6.1-3.fc37 which is the last package version available in fedora 37.
I checked the checkum from the cronie package:

[user@dom0 ~]$ sudo rpm -V cronie
..?......    /usr/sbin/crond

The rpm man page says:

a single “?” (question mark) indicates the test could not be performed (e.g. file permissions prevent reading)

So I can’t conclude

What is the binary checksum?

[user@dom0 ~]$ ls -al /usr/sbin/crond
-rwxr-xr-x 1 root root 75072 Jul 21  2022 /usr/sbin/crond
[user@dom0 ~]$ sudo sha256sum /usr/sbin/crond
sha256sum: /usr/sbin/crond: Input/output error

Why can’t sha256sum read the file ? Same problem with md5sum …

A disk issue

The file command works:

user@dom0 ~]$ file /usr/sbin/crond
/usr/sbin/crond: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=83da550dea1764b03a0f041494e9c1df758ce1ca, for GNU/Linux 3.2.0, stripped

The file command reads only the header of the file, whereas sha256sum reads all the file.

So is it an issue with my filesystem?

[user@dom0 ~]$ sudo dmesg -T
...
[Sun Sep 29 13:40:05 2024] nvme0n1: I/O Cmd(0x2) @ LBA 7792688, 8 blocks, I/O Error (sct 0x2 / sc 0x81) 
[Sun Sep 29 13:40:05 2024] critical medium error, dev nvme0n1, sector 7792688 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
...

Arghhh my SSD is dying…

But the good news is that I did frequent 3-2-1 backups. So now I will try to solve my disk issue, else I will buy a new SSD and use my backups.

ludovic · September 29, 2024, 2:34pm

dmesg -Tw shows always the error on the same sector ( 7792688 ), because each 30s crond try to start. So maybe my disk got only some failed sectors…

So I did a smartctrl analysis:

[user@dom0 ~]$ sudo smartctl -a /dev/nvme0n1
...
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    99%
Available Spare Threshold:          5%
Percentage Used:                    1%
Data Units Read:                    39,868,786 [20.4 TB]
Data Units Written:                 13,268,237 [6.79 TB]
Host Read Commands:                 303,385,584
Host Write Commands:                131,954,715
Controller Busy Time:               3,717
Power Cycles:                       411
Power On Hours:                     188
Unsafe Shutdowns:                   40
Media and Data Integrity Errors:    251,677
Error Information Log Entries:      368
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0        368     0  0x0010  0x8004  0x000            0     0     -  Invalid Field in Command

Read Self-test Log failed: Invalid Field in Command (0x002)

My disk owns 99% available spare, so maybe I can register the failed sectors as not-to-be-used and continue to use my SSD. But it’s the start of the SSD end-of-life.

tempmail · September 29, 2024, 3:18pm

Wow! You should let us know which brand is that, so we could avoid it!

alimirjamali · September 29, 2024, 6:13pm

This sucks. Reminds me of my recent incident:

I have not replaced that SSD yet. But removed as much as data as possible (specifically Windows 11 template) to extend its life and it is working fine for the time. The new model I ordered has some DRAM and much better rated lifespan.

ludovic · September 29, 2024, 6:50pm

from smartctl :

Model Number:                       CT1000P5SSD8

Yes I remembered your SSD issue when I wrote this topic.

The solution

I booted a rescue OS (Clonezilla) from an USB drive and I opened the LUKS block device:

cryptsetup luksOpen /dev/nvme01n3 xxx

Then I checked the dom0 filesystem for badblocks:

fsck -vcck /dev/qubes_dom0/root

which found only one bad block on the /usr/sbin/crond file. fsck registered this block as a bad block.
Then I started Qubes-OS, as expected crond daemon continued to failed. I re-installed the cronie package:

[user@dom0 ~]$ sudo qubes-dom0-update --action=reinstall cronie

and crond started correctly:

[user@dom0 ~]$ journalctl -fu crond
Sep 29 18:19:34 dom0 systemd[1]: Started crond.service - Command Scheduler.
Sep 29 18:19:34 dom0 systemd[1]: crond.service: Main process exited, code=dumped, status=11/SEGV
Sep 29 18:19:34 dom0 systemd[1]: crond.service: Failed with result 'core-dump'.
Sep 29 18:20:04 dom0 systemd[1]: crond.service: Scheduled restart job, restart counter is at 9.
Sep 29 18:20:04 dom0 systemd[1]: Stopped crond.service - Command Scheduler.
Sep 29 18:20:04 dom0 systemd[1]: Started crond.service - Command Scheduler.
Sep 29 18:20:04 dom0 crond[6200]: (CRON) STARTUP (1.6.1)
Sep 29 18:20:04 dom0 crond[6200]: (CRON) INFO (Syslog will be used instead of sendmail.)
Sep 29 18:20:04 dom0 crond[6200]: (CRON) INFO (RANDOM_DELAY will be scaled with factor 95% if used.)
Sep 29 18:20:04 dom0 crond[6200]: (CRON) INFO (running with inotify support)

So the problem solved, but now I will regulary check my filesystem for such issues. And my spare SSD is already to replace the fault one.

I will also check my other LVM volumes…

alimirjamali · September 29, 2024, 7:04pm

I have been scheduling long SMART self-tests on weekly basis during nights. Overall all of such incidents suck. SSDs were supposed to save us from many faults we had experienced with spinning-rust. Your SSD is a 1TB recent model from a well reputed brand which most of us would consider trust-worthy. How much free space did it have at the time of incident? may I ask?

ludovic · September 29, 2024, 7:31pm

Far from beeing full, the disk widget displays:

Total disk usage 35%
Volumes
  linux-kernel
  varlibqubes 
    data     45.4%     8.9 GiB / 19.5 GiB
  vm-pool
    data     35.5%   289.3 GiB / 815.1 GiB
    metadata 40.0%

The Qubes-OS Update tool detected the 4.2.3 update and installed it:

[user@dom0 ~]$ grep ^P /etc/os-release 
PRETTY_NAME="Qubes OS 4.2.3 (R4.2)"