Since the beginning (qubes 4.0), I would have occasional system crashes that I thought was a graphics card hardware problem.
Eventually It’s changed from a full lockup to a “abrupt logout” (while working on the computer (I.E. not screensaver/lockscreen)).
I started suspecting dom0 memory, so I increased dom0 memory multiple times, each time it took longer before it crashed or did a “abrupt logout”. dom0 is now up to 16 Gigs with the following grub line:
GRUB_CMDLINE_XEN_DEFAULT="console=none dom0_mem=min:16024M dom0_mem=max:16096M ucode=scan smt=off gnttab_max_frames=4048 gnttab_max_maptrack_frames=8096"
This last time i got a abrupt logout, I found things in dmesg. Here are highlights:
[Tue Oct 17 13:37:04 2023] Tasks state (memory values in pages):
[Tue Oct 17 13:37:04 2023] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[Tue Oct 17 13:37:04 2023] [ 9706] 0 9706 1730338 31992 4812800 0 0 Xorg
...
[Tue Oct 17 13:37:04 2023] [ 11459] 1000 11459 2133108 9083 458752 0 0 pulseaudio
...
there were 31 pacat-simple-vc’s running. 2 examples:
[Tue Oct 17 13:37:04 2023] [ 11767] 1000 11767 85475 942 81920 0 0 pacat-simple-vc
[Tue Oct 17 13:37:04 2023] [ 11852] 1000 11852 85475 745 81920 0 0 pacat-simple-vc
Finally getting to:
[Tue Oct 17 13:37:04 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/lightdm.service,task=Xorg,pid=9706,uid=0
[Tue Oct 17 13:37:04 2023] Out of memory: Killed process 9706 (Xorg) total-vm:6921352kB, anon-rss:101324kB, file-rss:248kB, shmem-rss:26312kB, UID:0 pgtables:4700kB oom_score_adj:0
[Tue Oct 17 13:37:56 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Tue Oct 17 13:37:56 2023] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Killing Xorg would obviously cause a abrupt log out . (note: The hardware error started 52 seconds later, so was probably somehow caused by killing Xorg suddenly.)
I’m now trying to figure out how much memory Xorg was taking up when it was killed. Unfortunately, when I try searching for what those columns mean, and then try to use that information, the results never make sense.
I’m assuming that most all the memory 16 gigs allocated to dom0 is being consumed when the oom-killer kicks in
One thing I heard was “multiply the numbers by 4096”, which means for Xorg the total_vm was 6.7 gigs, and the rss was 249 megs. 6.7 gigs sounds reasonable until you add up the total_vm of all the processes, which gives 59.5 Gigabytes, instead of 16.
So then I think maybe total_vm isnt the real number (the vm probably stands for “virtual memory”), maybe rss is the real number of how much memory is being taken up, but it’s only 249 megs and the total rss for all processes is only 0.9 Gigs, far short of the 16 Gigs.
Can anyone tell me how to determine what process is consuming so much memory that the oom-killer is kicking in?