Mirage firewall breaks connectivity daily

Mirage is a Unikernel that can replace the sys-firewall in QubesOS. It is not shipped natively and as such, support is not assumed here. This will likely be a problem to follow up at GitHub - mirage/qubes-mirage-firewall: A Mirage firewall VM for QubesOS

But for sake of discussion - is anyone using Mirage having or ever had problems with connectivity breaking? Often I’ll leave my machine, come back and all networking (except sys-net) is dead. I must restart sys-mirage-firewall for networking to resume. And as soon as it restarts, everything starts kicking back off where it began.

Obviously not ideal. Real world behaviour should not see any interruption in connectivity ever.
I found the problem a lot worse/frequent with default 32M RAM. After I upped the CPUs and RAM a bit, it does it less frequently. I’ve just upped it again to 128MB/4c and will monitor before making any bug report.

AppVM logs don’t reveal much either unfortunately.

Hi, thanks for the report. I don’t observe such issue on my laptop but unfortunately I don’t have 4.2 right now :frowning:

When you say AppVM logs don’t provide information, you talk about mirage-fw logs right?
FWIW there’s a pending PR that fixes an issue with uplink which might fits your symptoms description, if you’re ok with that would you mind to compile locally and try out a fresh build?

1 Like

Sure am. I’ll have another look if/when the problem reoccurs. I just recall seeing traffic type logging. No particular errors.

I compiled with docker per the instructions on Github, and my system is a very recent install of 4.2. So shouldn’t be anything funky going on.
Would the current upstream PR only relate if sys-net (my only upstream from sys-mirage-firewall) loses connectivity? This issue seems unrelated to any sys-net issues. I can always open a terminal in sys-net and ping successfully when sys-mirage-firewall is in this broken state.

The current mirage-fw also doesn’t support its netvm to be changed/restarted. The connectivity loss can also occurs if sys-net changes the vif dedicated to its mirage-fw client (unsure why this could happen).

But thinking twice, if raising the memory solves/delays the issue, it’s propably something else related to an old memory fragmentation issue being back :frowning:

Ah, this sounds plausible. Is there any superficial testing I could do to help diagnose this as a potential issue?

Should I expect to see mem use gradually increase under dom0 xentop for example?

No it won’t, memory is reserved once at startup and the Ocaml runtime ask for memory from time to time but that won’t be visible from Xen.

If the pressure on memory (e.g. from fragmentation) is getting high, the unikernel starts to call the GC more often and if it still not enough, it starts to drop packets (but tracing was removed some time ago). You might observe a high CPU usage for a while due to the GC calls. And finally at some point it should kill itself with a “out of memory” log message.

The last release also doesn’t report to Xen the memory information to stop being involved in the memory ballooning process, maybe I failed somewhere there and it’s mandatory with 4.2?

I don’t know if it is mandatory, but since you mention memory balooning and 4.2, have you seen this thread @palainp ?

1 Like

Another thing I’ve noticed is that sys-firewall will (as expected) saturate my 1Gb link, but mirage will top out and only achieve ~530mbps.

Yes TCP Segmentation Offload isn’t available with mirage so far. That’s something to code in the future.

You should be able to check that with (in sys-net or sys-firewall : sudo ethtool -K eth0 tso off, may need to adapt the interface name).

Thanks ! I’ll try to upgrade my laptop soon ™ to investigate further.

I think I got a “sort of” similar issue but this was limited to dns resolution. Next times this occurs, and if you haven’t done already, would you mind to test something like ping in your AppVM?

Was definitely a total failure in connectivity for when I was having the issues. I’ve gone back to a minimal net/firewall in the interim hoping it’ll get resolved eventually.

Hi @xmpriv, sorry for the very long standing here.

If you’re still keen on testing, would you mind to try out to set mirage-fw kernelopts to '-l debug', and wait for the network to be lost (on my end I have to start some traffic with firefox, ping alone doesn’t break the network flow).
You’ll get a log of debug informations, but I observed, right after the network being killed, some [DEBUG] [uplink] received ipv4 packet from x.x.x.x on uplink without intermediate nat-rewrite informations. To me, this means we hit the Memory_pressure limit [1] and the fw just drop the packets… This makes sense with observations where increasing the fw dedicated memory helps a bit (but I don’t understand why I can’t reproduce that earlier).
It would be great if you could test without this Memory_pressure check (The test should be to use the mirage-fw as usual and see if newer ocaml and mirage/mirage-fw can work normally and manage by itself the memory heap without the need for manual garbage collection. If not a OOM log message will appear.), you can use the following branch (and build with either podman or docker) [2]. I tried a couple hours and it seems to work correctly but it’ll surely need more deep/longer testing.

[1] qubes-mirage-firewall/firewall.ml at b318fabd43c66eb2e83ec566da662525c39e229c · mirage/qubes-mirage-firewall · GitHub
[2] GitHub - mirage/qubes-mirage-firewall at test-no-memory-pressure

EDIT: So it still run OOM after some times, don’t bother to test it apart from the possible root of the network shutdown (I mean, you can try to reproduce the “drop packet” behaviour to be sure where the issue is, but I so far have no solution :frowning: )

1 Like

Kindest apologies for the delay. Kept a mental note to come back when I finally had some spare time to test what I saw you request via my reply notification email.

I now, however, see your edited remark about not bothering. That’s a shame alas. But much appreciation for taking the time to try make headway on it. Wish I could help there but unfortunately have little value to input on that part.

Let me know if you make any other developments and need some help testing.

Yes, sorry for the wild edit, I often forget email forum users :frowning:
In the meantime, I have not been able to reproduce the issue. But now that the dynamic uplink code is in the release and seems to work for others, maybe it’s worth trying again? (Actually, if you have DNS firewall rules, you might want to wait a bit for a new release: Mirage-firewall 0.9.0 released - #6 by neoniobium).

1 Like