Getting accelerated 3D on Qubes is probably one of the biggest pain points. Even if everyone could forward a second GPU to a single domain using PCI passthrough (without even talking about the difficulties involved), we would not be able to do that to 2 domains simultaneously, and would have to face security implications even when dealing with 2 VMs non-simultaneously.
But there is another approach that (hopefully) would allow to overcome all of this: streaming the OpenGL calls from individual domains to a āGPU serverā in sys-gui-gpu.
I played with the idea, happy that prior art existed - although trying that original codebase (written for rendering on Android, and itself based on a codebase for rendering on a Raspberry Pi, no shit) requires some amount of motivation.
Iām not quite to the point of getting this to run in sys-gui-gpu (partly because I donāt have one running yet ), but mostly because the original code is incomplete, only works when compiled as 32bit code (and thus with a 32bit apps), uses an insecure network protocol sharing and accepting pointers, and other fun stuff.
Nevertheless, Iāve started to poke at it as an experiment, to the point Iāve been able to run es2gears. That is, a 32bit version of it, with the āGLES serverā running in the same domain with Mesa software rendering. And the small window thatās running 1000fps with plain Mesa runs at an anemic 70fps. That may sound not so promising, but given the state of the stack there is quite some room for improvementaround 220fps which is finally not so bad given the room for improvement, and Iām submitting this to your thoughts.
Have fun
Edit: a simple removal of apparently-arbitrary throttling already turned the original 15x slowdown into a mere 5x slowdown. Stay tuned for hopefully better stuff
2 weeks after this heads-up, things have progressed nicely (given the amout of time Iāve been able to put into this). Progress highlights:
now testable with 64bit apps (while loosing temporarily the ability to work on 32bit)
texture support now working
a number of fixes to original forwarding of some APIs was done, and more were implemented
Much remains to be done for real-life usage, but more and more glmark2 benchmarks now run (textures, shading, bump, build:use-vbo=true, pulsar, function, conditionals).
High priority short term work:
provide at least a stub for all GLES2 core functions (all apps out there carelessly suppose the full API is implemented, and currently happily proceed calling NULL)
get an appās graphic resources freed in the server when a client disconnects (we sometimes have fun interactions, not mentioning saturating memory)
Iāve been postponing a status update for a few days as I kept stumbling on problems preventing to see the actual progress, but here are the highlights.
near-complete rewrite of the limited UDP networking, with a TCP implementation. No more limitation on the size of textures and other buffers to be uploaded to the GPU.
lots of efforts into code cleanup and internals documentation (and still lots on my todo list, and not everything listed in the TODO file )
many bugs addressed
integrated and finalized useful work from all public upstream branches on github
last but not least, more coverage of both EGL and GLES2 APIs:
glmark2 can now be launched for its default set of benchmarks, the still-not-implemented functions donāt prevent it to run everything it can (it does skip all tests needing an extension, 2 tests miss a few crucial APIs, only 1 test clearly lacks something without any warning to give a clue)
Bobby Volley 2 can be launched and played (far from an AAA game, but hey, thatās still a real game, and it is playable despite a few missing APIs)
Next focus:
more coverage and some benchmark numbers
finally deal with the graphic resources isolation and freeing already mentioned, some crucial infrastructure work is already there
finally rework the protocol to get rid of those pointers-on-the-wire (yuck)
I made a quick FPS comparison, using todayās 0b6cbf7265314 revision from git master, with the glmark2 default test (windowed, 800x600), comparing:
running natively on an AMD Stoney APU
running through gl-streaming through the loopback network interface
running glmark2 on a higher-end (QubesOS) laptop, streaming to the AMD Stoney APU through gigabit ethernet
Keep in mind that this is still preliminary code, with no profiling/optimisation done yet (I even removed most upstream optimisations, which were slowing down feature progress).
And all of this, obviously, is just statistics
Potentially interesting information:
from GLS alone, most tests suffer about 60% drop in FPS, some perform noticeably better, but one (the famous ideas scene, which does not render properly to start with) suffers from nearly 90% FPS drop
the tests that show to be the most demanding in the native case usually donāt suffer as much from being used through GLS ā that could be a rather good sign for real-life loads
if we add a real network in the picture, most tests show 5-10% additional loss, except:
one test using client-data instead of vbo (where we likely transfer the data several times, no surprise thatās great handicap when that happens over the network)
several tests show a better score than with GLS alone, which seems to hint that in those particular cases, the more powerful CPU can indeed compensate for part of the loss in GPU throughput (could those be CPU-bound benchmarks of the GPU ? would not sound good)
Iām currently working on proper support for EGL and GLES extensions, with the former pretty much working and the latter still get details wired. In parallel Iāve found a set of examples from a GLES2 book which put the spotlight on a nasty issue (for which I pushed a wrong fix to master, stay tuned) which may turn out to be the root cause for glmark2 -d ideas not rendering everything it should ā Iām still investigating this one.
Iāll post a more formal āprogress reportā once those 2 items are fully dealt with.
It would be really great to have more real GLES2 applications making use of 3D, as mostly everything out there on the Desktop is using Desktop GL and GLX. I have a plan to take some godot3 3D games, which are mostly using GLES3, though it is quite easy to ask godot3 to use GLES2 instead ā but that wonāt be as good as a game designed for GLES2, weāll just have very few graphical effects availableā¦
It took some time since last update, and accordingly thatās a pretty big one:
a process is forked on server side for each client connecting, providing both resource isolation (and freeing on client termination) and support for several simultaneous clients
server-side window is now created at WindowSurface-creation time, and has the expected size (server window does not get resized afterwards, but the displayed size can usually be reduced by making the input window smaller)
support for extensions, both for EGL and GLES2, with a couple of them implemented, as required/useful for the test programs at hand
only EGL 1.0 is advertised, now that glmark2 landed a conformance fix allowing it to run
code cleanups, doc improvements, and too many bugfixes to list here
As ārunning weston over X11ā implies, that seems to bring some Wayland support, but wellā¦ the compositor having access to GPU acceleration does not necessarily mean the wayland apps produce GPU-accelerated streams to be composed, or that meaningful wayland apps can run yet (eg. glmark2-es2-wayland does not yet).
what next ?
It looks like a good point in this prototype to get a look at rendering the window in the GPU domain. Iāll first have a look at running the server on the GPU domain. There are 2 complementary paths there:
traditional GPU-in-dom0 setup, but without networking in dom0, will require a virtio-based transport
sys-gui-gpu setup, where possibly the current TCP transport can be used as-is (can yield sooner results, but Iāll have to get that sys-gui-gpu to work first on my machine)
With the display being done in the same domain as the server, it will be possible to look at how to use a single X11 window for input and display.
I have taken some time to be able to passthrough my own GPU to sys-gui-gpu to be able to make use of this nice work myself
This being slowed by build times, in turn has brought me to see how we could improve the Qubes build system ā but it appears not a good moment to work on big changes there, so Iām refocusing on investigating just activating various acceleration options in Mock (ccache, buildchroot caching and such) to make stubdom rebuild times more acceptable.
With a bit more work and some luck I may end up having this sys-gui-gpu working (and this, as being just a personal side-project, can take some time).
But once these digressions get done, I intend to look first into how to leverage the Qubes GUI protocol to avoid the double-window issue (which should make the whole thing realistic to use, even though a couple of important things will still need to be fixed before calling it a first release).
Maybe Iām missing the point here, but why donāt you use a vchan/qrexec-rpc instead of tcp? This would a) solve the issue with lacking networking connection to dom0 and b) implement the āqubes-wayā without opening additional loopholes.
Indeed when Iāll have a sys-gui-gpu ready, tunneling the TCP stream through qrexec is already on top of my TODO list (and can already be tested today, there are examples showing how to do that, including the SplitSSH configuration). But it cannot be a long-term solution if we want to implement the GL/VK calls involving shared memory - I still have to get into vchan to see if they can be used, but there may be a need to get down to grant tables, for that particular topic at least.
Just tunneling TCP will not help to get rid of the ādouble windowā issue: while a window gets created in the GPU domain to do the real display job, we still have the X11 window created in the qube where the app runs. Fortunately this window is in fact displayed by the GUI domain, and Iām pretty much convinced the GUI protocol can be leveraged to avoid this extra window frame.
We could argue that TCP + GUI protocol would fit your āwithout opening additional loopholesā criteria. Whether it is required for correct performance to implement shared memory can be left for later to evaluate (and is not in the top items of my TODO list anyway).
I think there there are two different requirements and TCP wonāt solve any of them.
The first is having some sort of socket connection. When available, Iād always prefer a unix socket over a TCP connection because of system resources and latency - you donāt have to go up and down the stack to get your message across. To me qrexec/vchan comes closest to a unix socket and there is a very good reason to not have any networking in dom0 (and probably sys-gui[-gpu] too). So why even bother with TCP when you can also use a socket connection without network? Requiring network in either dom0 or sys-gui[-gpu] will be a showstopper for even considering upstream adoption and/or interest by serious Qubes users.
The other one is direct memory access and/or which protocol layer to tunnel for best performance/functionality. I totally get this point, but there is absolutely nothing that TCP buys you here over qrexec/vchan other than bi-directional establishment of communication.
Keep in mind I forked a project that was (badly) using UDP. TCP is just for the current intermediate state of PoC, and at this stage latency is not the problem to be solved
TCP/UDP/Unix-sockets buys us the ability to easily develop client and server in the same VM (though maybe we could use vchan to communicate with the local VM, something like a āloopback vchanā ?), as well as testing outside of Qubes to get some comparisons (though I agree that last one is not a deal breaker)
As a first step towards Qubes GUI integration, and as a followup to previous posts, I updated usage instructions with the details for setting up a GLS server in the GPU domain.
My plans for the next step is to dig in qubes-guid so the GLS server can query it for the GUI-domain window backing the display, instead of creating a new one of its own (which currently appears as a belonging to the GPU domain).
Once that is done and a proper design is validated, weāll be able to address the known issues and aim for a first release. And then go Vulkan .
This was easier than expected: since qubes-guid sets window property in the gui-domain to point to the source vm and window-id, I was able to just spy for them and create the GLES context on this window.
There is a quick-and-dirty PoC in the wip/qubes-window branch, but things will have to be put straight before it is ready to merge (eg. the TCP mode canāt be used any more in that branch).
I noticed that in fact, socat is really slow: just using it on localhost to forward TCP traffic between local client and server is enough to drop the es2gearts framerate from ~500fps to ~60fps . Thatās probably enough to explain the terrible performance I get with this branch (there are 2 socat processes in the pipe, so if each causes a 9x slowdown, the numbers make senseā¦).
I now have a PoC for a qrexec transport (in the wip/transports branch. Performance-wise, all benches are capped to my 144Hz framerate, and glamrk2 only gets a global mark of 143 because of its rounding (only the most 2 complex scenes went down to 143, and I believe they would each have got a higher mark without the capping). So great news overall, and a string hint to get rid of that framerate cap
That branch is a WiP one too, and not built atop wip/qubes-window, so you canāt get all the good stuff in one test yet. This branch is much closer to be mergeable than the latter, though: it already provides compile-time selection of the transport to be used, with choices of TCP (the implementation currently in master), stdio (useful for testing, and for the GUI-domain server behind qrexec), and qrexec-client-vm (communicating by pipes with said tool to abstract qrexec communication details). Some dynamic selection and small cleanups will be needed, but it should land in master real soon now. Only after will it be time for a proper implementation of wip/qubes-window.
Edit: advertizing EGL 1.1 instead of 1.0 (despite only eglSwapInterval being implemented of 1.1) is sufficient to unleash the FPS in glmark2. Current framerates in this version are about 10% of the native speed (yes, -90%, quite some progress margin, though already better than software rendering on the same machine, which performs at 1/25th of the native GPU speed).
Quick update, before really diving into shared-memory transport: there was a huge performance hit coming from an old and easily fixed remaining sleep-in-a-loop I had missed, and glmark2 in a qube through ql-streaming now runs barely 3 times slower than natively in dom0.
Details of the benchmark can be found here.