Dom0 crashes and abrupt logouts traced to oom-killer

ddevz · October 19, 2023, 5:31pm

Since the beginning (qubes 4.0), I would have occasional system crashes that I thought was a graphics card hardware problem.

Eventually It’s changed from a full lockup to a “abrupt logout” (while working on the computer (I.E. not screensaver/lockscreen)).

I started suspecting dom0 memory, so I increased dom0 memory multiple times, each time it took longer before it crashed or did a “abrupt logout”. dom0 is now up to 16 Gigs with the following grub line:

GRUB_CMDLINE_XEN_DEFAULT="console=none dom0_mem=min:16024M dom0_mem=max:16096M ucode=scan smt=off gnttab_max_frames=4048 gnttab_max_maptrack_frames=8096"

This last time i got a abrupt logout, I found things in dmesg. Here are highlights:

[Tue Oct 17 13:37:04 2023] Tasks state (memory values in pages):
[Tue Oct 17 13:37:04 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
...
[Tue Oct 17 13:37:04 2023] [   9706]     0  9706  1730338    31992  4812800        0             0 Xorg
...
[Tue Oct 17 13:37:04 2023] [  11459]  1000 11459  2133108     9083   458752        0             0 pulseaudio
...

there were 31 pacat-simple-vc’s running. 2 examples:

[Tue Oct 17 13:37:04 2023] [  11767]  1000 11767    85475      942    81920        0             0 pacat-simple-vc
[Tue Oct 17 13:37:04 2023] [  11852]  1000 11852    85475      745    81920        0             0 pacat-simple-vc

Finally getting to:

[Tue Oct 17 13:37:04 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/lightdm.service,task=Xorg,pid=9706,uid=0
[Tue Oct 17 13:37:04 2023] Out of memory: Killed process 9706 (Xorg) total-vm:6921352kB, anon-rss:101324kB, file-rss:248kB, shmem-rss:26312kB, UID:0 pgtables:4700kB oom_score_adj:0
[Tue Oct 17 13:37:56 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Tue Oct 17 13:37:56 2023] {1}[Hardware Error]: It has been corrected by h/w and requires no further action

Killing Xorg would obviously cause a abrupt log out . (note: The hardware error started 52 seconds later, so was probably somehow caused by killing Xorg suddenly.)

I’m now trying to figure out how much memory Xorg was taking up when it was killed. Unfortunately, when I try searching for what those columns mean, and then try to use that information, the results never make sense.

I’m assuming that most all the memory 16 gigs allocated to dom0 is being consumed when the oom-killer kicks in

One thing I heard was “multiply the numbers by 4096”, which means for Xorg the total_vm was 6.7 gigs, and the rss was 249 megs. 6.7 gigs sounds reasonable until you add up the total_vm of all the processes, which gives 59.5 Gigabytes, instead of 16.

So then I think maybe total_vm isnt the real number (the vm probably stands for “virtual memory”), maybe rss is the real number of how much memory is being taken up, but it’s only 249 megs and the total rss for all processes is only 0.9 Gigs, far short of the 16 Gigs.

Can anyone tell me how to determine what process is consuming so much memory that the oom-killer is kicking in?

Bearillo · October 19, 2023, 5:47pm

e.g. top in dom0 console; if you want something more GUI-based you can set up e.g. “CPU-Graph” or “System Load Monitor” in the “Panel” menu; the latter allows live tracking of dom0 memory; clicking on either in the Panel will then launch the dom0 Task Manager that shows you all processes and their resource usage.

ddevz · October 19, 2023, 5:59pm

Unfortunately, since Xorg was already killed, a new Xorg was created and now has a much smaller memory footprint. Also, killing Xorg seemed to cause pulseaudio to be terminated and to spawn a new process (probably after I logged back in)

Currently top says the biggest memory users are:
xfdesktop at 1.1%
Xorg at 1.1%
xfwm4 at 0.7%

Bearillo · October 19, 2023, 6:02pm

Well the idea was that you monitor memory either with top running continuously or the mentioned Panel widgets and when you see it getting to high levels you should then be able to check which processes are the problem.

ddevz · October 19, 2023, 6:04pm

Yes. That’s my plan for next time. I have a cron script to use “notify-send” to throw a message when dom0 memory is low in hopes of doing that.

ddevz · October 19, 2023, 6:06pm

I just checked out the gui “task manager” that you mentioned, and that looks helpful. Thanks

renehoj · October 19, 2023, 6:10pm

You can try sudo pmap -x PID to inspect how the xorg process uses the memory.

ddevz · October 19, 2023, 6:18pm

Interesting… pmap shows the total Kbytes of Xorg currently at 5.5G (out of 16Gig), whereas top shows the memory of Xorg at 1.1%

balko · October 23, 2023, 6:28pm

Maybe it is somehow related to generation of some previews of the video that are not properly cleared?
Can you turn off the compositor in the DE, will it make a difference?