Notifications when dom0 or Xen memory is low

ddevz · January 11, 2024, 8:11pm

Split from the Quick Quality-of-Life Improvements thread

Putting this in cron for every 5 min in dom0 can warn you before the system grinds to a halt because of dom0 memory. (adjust the the “4” to whatever you want your memory threshold to be in gigs, and adjust the expire time (in miliseconds))

FREE_MEMORY=$((`free -g | grep '^Mem' | awk '{print $7}'`))
if (( $FREE_MEMORY < 4 )); then
    notify-send  --expire-time=360000 --urgency=critical 'RUNNING OUT OF DOM0 MEMORY!!!!!!' "DOM0 memory is down to $FREE_MEMORY Gigs...  DO SOMETHING!!!!! :) "
fi

Similarly, putting this in cron to run every 5 min in dom0 can warn you before the system runs out of xen memory for new qubes. (adjust the the “8” to whatever you want your memory threshold to be in gigs, and adjust the expire time (in miliseconds)) (note: this situation was a bigger problem in qubes 4.1 then qubes 4.2)

FREE_MEMORY=$((`xl info | grep free_memory | sed 's/^.*:\([0-9]*\)/\1/'` / 1000))
if (( $FREE_MEMORY < 8 )); then
    notify-send --expire-time=360000 --urgency=critical 'RUNNING OUT OF XEN MEMORY!' "Xen memory is down to $FREE_MEMORY Gigs...  Kill some VM's!"
fi

fiftyfourthparallel · January 12, 2024, 3:20pm

I added your tip to the top; please check for technical accuracy

arkenoi · January 12, 2024, 8:29pm

Quite misleading either, unfortunately. Both values are down to ridiculously low amounts after long uptime – and the system still has some caches to free when it needs memory… I have 393 megs free as reported by xl info and 1g free in dom0 – and the system runs smoothly for weeks.

fiftyfourthparallel · January 16, 2024, 12:12pm

A post was merged into an existing topic: Quick Quality-of-Life Improvements

ddevz · January 15, 2024, 7:03pm

How many qubes do you run at a time? (and how often do you shutdown/launch new ones?)

My system used to suddenly slow down to a point of having to do a hard power off. I found it was a dom0 memory problem. My dom0 memory will drop like crazy, but I run a lot of qubes (I checked and at this moment I have 26 qubes running). I believe the issue is the compositor storing all redraw information. To compensate for this, I changed the trigger from 8 gigs in dom0 (which i use on my system), to 4 gigs when posting it. But maybe I missed the mark by a order of magnitude

If you are stable at 1000 megs in dom0 we could change the free -g part to free -m, and the change the 4 to 500. The objective is just to not let it get to 0. That might be more applicable to the average user

Are you running 4.1 or 4.2? Its critical in 4.1 to not let the xl memory get down to 0 because the memory balloon doesn’t seem to actually work in 4.1 , and will cause a newly launched VM to just be killed off and possibly other undesirable outcomes. The memory balloon does seem to actually work in 4.2 though, so avoiding getting down to 0 memory in 4.2 is not so critical. While i still hesitate to leave my system at 0 gigs of xl memory for more then a min or two, it’s possible it’s actually fine to leave it there in 4.2. In that cause we would need to come up with a way to determine the free memory that’s available for the memory balloon inside of all the running qubes, and sum them together, and add them to the xl free memory.

ddevz · January 15, 2024, 7:15pm

Do to the recent discussion, in “/bin/dom0-Memory-Notification”, can you change
free -g to free -m
and
$FREE_MEMORY < 4 to $FREE_MEMORY < 500
and
"Remaining: $FREE_MEMORY " to "Remaining: $FREE_MEMORY Megs"?

Also maybe add something to emphasize that it just warns when under the threshold and that they need to set the threshold to meet their needs? (i’m fine if you want to turn the 500 into a named variable)

Also, for /bin/Xen-Memory-Notification can you change
"Remaining: $FREE_MEMORY" to "Remaining: $FREE_MEMORY Gigs"?

arkenoi · January 16, 2024, 7:19am

I run about 10-12 desktop qubes and 5-7 service qubes, starting and shutting down often.
It is 4.2, 342M free now in xl info, uptime is 52 days and the system is perfectly healthy.

fiftyfourthparallel · January 16, 2024, 12:12pm

I changed your tip as requested; please check

Also, since the top post is a Wiki post, you are able to edit it yourself.

Now that I’ve split the post, please ensure the top post and your tip both reflect the latest accepted method.

ddevz · January 16, 2024, 5:45pm

looks good

ddevz · January 17, 2024, 7:53pm

If you are running 4.2, then try this for xen memory:

FREE_MEMORY=$((`xl info | grep free_memory | sed 's/^.*:\([0-9]*\)/\1/'` / 1000))
POSSIBLY_RECLAIMABLE_MEMORY=$((`xl vm-list | tail -n +3 | awk '{print $3}' | xargs -I {} qvm-run --pass-io {} "free -m| grep 'Mem:'" | awk '{ print $7}' | paste -sd+ | bc` / 1000))
if (( $FREE_MEMORY + $POSSIBLY_RECLAIMABLE_MEMORY < 8 )); then
    notify-send --expire-time=360000 --urgency=critical 'RUNNING OUT OF XEN MEMORY!' "Xen memory is down to $FREE_MEMORY Gigs... with $POSSIBLY_RECLAIMABLE_MEMORY Gigs in possibly reclaimable memory...  Kill some VM's!"
    #setting expire-time to 6 min (cron check is every 5 min) (360000 = 6 min)  
fi

It includes what I believe would be the reclaimable memory in running qubes, and so it should be good for 4.2 . Of course it’s more complicated then the old version, meaning it’s harder to understand and audit, thus meaning that people should (would?) be more reluctant to use it. Also, it’s substantially more processing on dom0 now.

ddevz · March 21, 2024, 8:30pm

I’ve done substantially more aggressive testing with 4.2, to where I overcommit the memory now and leave the “free xen memory” at 0 during standard operation.

Recently I got a “not enough memory” error when trying to start a VM. I checked and there was more then 8 gigs availble when looking at the free memory of each qube, and adding it to the “free xen memory”.

There must be something else going on. Like maybe it reserves a certain amount of memory per qube, that it refuses to reclaim, even though the memory is not in use, or something like that. If anyone has ideas, please speak up.

obviously I could just change the script to trigger at 9 gigs instead of 8, but would prefer to figure out what’s really going on to come up with a better way of estimating.

renehoj · March 21, 2024, 8:41pm

Free doesn’t mean the page can be reallocated, the entire page needs to be untouched for Xen to be able to assign it to a new domU.

ddevz · March 21, 2024, 9:10pm

Interesting! So as time goes on, my estimation system will get worse and worse.

it looks like there are these commands/parameters:

echo 1 > /proc/sys/vm/compact_memory
echo 1 > /proc/sys/vm/compact_proactiveness

that can de-fragment memory. (and a related “extfrag_threshold”)

Has anyone ever tried to use these? I saw a reference to the compact feature causing kernel panics under red hat, but that was a long time ago.

ddevz · March 21, 2024, 9:20pm

For the dom0 memory, I have found that the hard to explain memory consumption culprit seems to be “slab memory”

doing:
echo 2 > /proc/sys/vm/drop_caches
in dom0 can reclaim a tiny bit of memory
(there is also a echo 3 > /proc/sys/vm/drop_caches option, but that does not seem to help any more then “echo 2” does)
Also of note, the dom0 slab memory seems to occasionally grow at a crazy rate (like a gigabyte every 5 min or so) when running libreoffice in a qube. I don’t know why this would be, but it seems very correlated.