One of my SSDs in RAID-1 (mirror) has died and I’m trying to boot.
The dead drive was the primary one (with EFI & boot partitions). However, I had its image and copied the EFI & boot partitions to the same place on the second (the last) drive - including all files, labels, UUIDs, flags, positions on the drive and sizes.
Now my boot process (initramfs) looks like:
md0 is built (degraded though)
crypt container has been decrypted
BUT here the system can’t boot because the “vg0” group it expects has been changed to “vg01”!
There are no conflicts, I don’t have any other LVM devices/groups. So the “vg01” is the only presenting LVM group now.
“lvm vgchange -ay” successfully activates all the volumes, they are accessible from the emergency console then. But I can’t boot even if I rename “vg01” to “vg0” (too late at this stage).
I don’t understand why it’s happening. There should be no difference for LVM because it doesn’t know anything about the RAID state. The boot process has already mounted the crypto container!
An option to the kernel to force LVM group name:
rd.lvm.vg=vg0
doesn’t make any sense.
I’m overwhelmed with this stuff. Once I had experience with Ubuntu that was loading fine from a degraded RAID in the same conditions.
Please help with advice on how I can load my Qubes OS?
Ideally if I can get by without regenrating initramfs.
I’ve rewritten the whole drive with the previous dd image of my primary drive and now I see no problems with vg0 naming.
Though after loading, the system disconnects the display completely and only “nomodeset” helps (just to check that the system didn’t fall into a kernel panic). In both cases, pressing the “Power” button triggers the system to shut down.
Is it normal behavior that Qubes OS doesn’t work on degraded RAID-1?
Who is guilty for that - Qubes OS, md-tools, LVM, or something else?
3 issues occurred at the same time forcing me to spend 10+ hours fixing them.
Buggy “nouveau” driver that decided it was time to fully disable both displays. Fixed after disconnecting one of them. For some reason I didn’t see this issue before my NVMe died. So I didn’t see anything on the screen without nomodeset flag after load.
When loading I needed to add this parameter to the kernel:
systemd.unit=rescue.target
then in the emergency console rename the VG and load the system:
But when the system starts, I see the login dialog, but the keyboard is totally disabled. That’s why before fixing (1) I thought that the PC has halted: screen is turned off, keyboard doesn’t react even on NumLock key press.
Investigations led me to the information that “sys-usb” hadn’t started.
Then I booted with the kernel argument:
qubes.skip_auto_start=1
and the keyboard was kept in Dom0.
I inspected the “sys-usb” qube’s settings and realized that it had some strange device in the “Devices” tab in the field “Devices always connected to this qube”:
08:00.3 Unknown device (unknown)
I removed this device and now Qubes OS loads fine.
So the reason was that I just disconnected the USB front panel of my PC case and this minor hardware change led to the malfunction of the whole OS!
This is a critical bug of Qubes OS and I’ll definitely post an issue on GitHub!
My proposal is that “sys-usb” should start even without removed optional USB devices AND it should return the chipset USB controller (with the keyboard) to Dom0 if it is unable to start for any reason. This is a blocker.
Hi. The software used for the forum (Discourse) has a Trust Level concept. New users with Trust Level 0 (Basic) are not able to edit their own posts after 24 hours. More information on the Discourse trust levels and how higher trust levels are granted here. If you need to edit the topic title, please let moderators know and specify the new title in the reply.
Thanks for the very complete analysis. I can never remember how to get the rescue console!
Did you find any reason for the renaming of the VG?
( I tried and failed one time to set up an installation on a degraded array like you describe. I never imagined a missing disk would have any effect on any device names. Maybe it is an explanation!)
(If you indicate your preferred title, then maybe someone here can change it - even me, maybe)
Well, I’m not an expert in LVM, but it looks like it was just named “vg01” in the LVM metadata.
I don’t know why on RAID-1 disks one mirror LVM has the “vg0” name, and the other - “vg01”. It’s weird, this could mean RAID-1 disks don’t contain the same data.
Probably it was renamed during the first load… but it still doesn’t answer why after renaming it back, it doesn’t continue renaming it again.
Ok, the first load from a degraded RAID may be special, and the OS may decline loading at all, indicating that one disk is missing. But why does this affect LVM, which is inside the RAID and read-only on the first load - it’s a puzzle.
I will just write down this case in my personal QubesOS/Linux FAQ to be ready for similar incidents in the future, so I have no time gaps in my work when (not if, because disks die from time to time) it happens again.
This week I upgraded my PC and realized that Qubes OS doesn’t start again. It wasn’t a surprise for me, but the affected qube was different.
This time I had to remove a (probably) network card device from the sys-net qube.
05:00.0 Unknown device (unknown)
Loading with the qubes.skip_auto_start=1 option passed to the kernel.
The result is the same - Qubes loaded fine after this simple action.
So the issue is not sys-usb-specific but common for any qube that tries to use removed hardware.
I can understand when my own qube doesn’t start (I have a custom storage VM that stopped autostarting since it used the USB ports of the old motherboard as well), but it’s some kind of lack of error handling when system qubes can’t continue without a device that isn’t even required.
Also, another question arises: should I manually passthrough new hardware (let’s say, new network card) via the sys-net qube?..
There might had been a reordering of PCIe devices. This is a known bug:
Glad that you already found the solution.
I believe that answer to be yes. The reason is simple. Having a new PCIe hardware automatically passed to a ServiceVM (based on its class) could be a security hazard. And in many cases, the user might want individual ServiceVM for the new hardware for better control.