Progress reports: streaming OpenGL/Vulkan calls to sys-gui-gpu

Getting accelerated 3D on Qubes is probably one of the biggest pain points. Even if everyone could forward a second GPU to a single domain using PCI passthrough (without even talking about the difficulties involved), we would not be able to do that to 2 domains simultaneously, and would have to face security implications even when dealing with 2 VMs non-simultaneously.

But there is another approach that (hopefully) would allow to overcome all of this: streaming the OpenGL calls from individual domains to a “GPU server” in sys-gui-gpu.

I played with the idea, happy that prior art existed - although trying that original codebase (written for rendering on Android, and itself based on a codebase for rendering on a Raspberry Pi, no shit) requires some amount of motivation.

I’m not quite to the point of getting this to run in sys-gui-gpu (partly because I don’t have one running yet :wink: ), but mostly because the original code is incomplete, only works when compiled as 32bit code (and thus with a 32bit apps), uses an insecure network protocol sharing and accepting pointers, and other fun stuff.

Nevertheless, I’ve started to poke at it as an experiment, to the point I’ve been able to run es2gears. That is, a 32bit version of it, with the “GLES server” running in the same domain with Mesa software rendering. And the small window that’s running 1000fps with plain Mesa runs at an anemic 70fps. That may sound not so promising, but given the state of the stack there is quite some room for improvement around 220fps which is finally not so bad given the room for improvement, and I’m submitting this to your thoughts.

Have fun :slight_smile:

Edit: a simple removal of apparently-arbitrary throttling already turned the original 15x slowdown into a mere 5x slowdown. Stay tuned for hopefully better stuff :slight_smile:

10 Likes

progress report

2 weeks after this heads-up, things have progressed nicely (given the amout of time I’ve been able to put into this). Progress highlights:

  • now testable with 64bit apps (while loosing temporarily the ability to work on 32bit)
  • texture support now working
  • a number of fixes to original forwarding of some APIs was done, and more were implemented

Much remains to be done for real-life usage, but more and more glmark2 benchmarks now run (textures, shading, bump, build:use-vbo=true, pulsar, function, conditionals).

High priority short term work:

  • provide at least a stub for all GLES2 core functions (all apps out there carelessly suppose the full API is implemented, and currently happily proceed calling NULL)
  • get an app’s graphic resources freed in the server when a client disconnects (we sometimes have fun interactions, not mentioning saturating memory)
4 Likes

progress report

I’ve been postponing a status update for a few days as I kept stumbling on problems preventing to see the actual progress, but here are the highlights.

  • near-complete rewrite of the limited UDP networking, with a TCP implementation. No more limitation on the size of textures and other buffers to be uploaded to the GPU.
  • lots of efforts into code cleanup and internals documentation (and still lots on my todo list, and not everything listed in the TODO file :wink: )
  • many bugs addressed
  • integrated and finalized useful work from all public upstream branches on github
  • last but not least, more coverage of both EGL and GLES2 APIs:
    • glmark2 can now be launched for its default set of benchmarks, the still-not-implemented functions don’t prevent it to run everything it can (it does skip all tests needing an extension, 2 tests miss a few crucial APIs, only 1 test clearly lacks something without any warning to give a clue)
    • Bobby Volley 2 can be launched and played (far from an AAA game, but hey, that’s still a real game, and it is playable despite a few missing APIs)

Next focus:

  • more coverage and some benchmark numbers
  • finally deal with the graphic resources isolation and freeing already mentioned, some crucial infrastructure work is already there
  • finally rework the protocol to get rid of those pointers-on-the-wire (yuck)
  • EGL/GLES extension handling
4 Likes

This seems like a really good idea, maybe we’ll even see a gaming qube in the future? Anyways, nice work!

performance

I made a quick FPS comparison, using today’s 0b6cbf7265314 revision from git master, with the glmark2 default test (windowed, 800x600), comparing:

  • running natively on an AMD Stoney APU
  • running through gl-streaming through the loopback network interface
  • running glmark2 on a higher-end (QubesOS) laptop, streaming to the AMD Stoney APU through gigabit ethernet

Keep in mind that this is still preliminary code, with no profiling/optimisation done yet (I even removed most upstream optimisations, which were slowing down feature progress).

And all of this, obviously, is just statistics :slight_smile:

Potentially interesting information:

  • from GLS alone, most tests suffer about 60% drop in FPS, some perform noticeably better, but one (the famous ideas scene, which does not render properly to start with) suffers from nearly 90% FPS drop
  • the tests that show to be the most demanding in the native case usually don’t suffer as much from being used through GLS – that could be a rather good sign for real-life loads
  • if we add a real network in the picture, most tests show 5-10% additional loss, except:
    • one test using client-data instead of vbo (where we likely transfer the data several times, no surprise that’s great handicap when that happens over the network)
    • several tests show a better score than with GLS alone, which seems to hint that in those particular cases, the more powerful CPU can indeed compensate for part of the loss in GPU throughput (could those be CPU-bound benchmarks of the GPU ? would not sound good)
1 Like

This is cool. Please keep us up to date on your progress!

I’m currently working on proper support for EGL and GLES extensions, with the former pretty much working and the latter still get details wired. In parallel I’ve found a set of examples from a GLES2 book which put the spotlight on a nasty issue (for which I pushed a wrong fix to master, stay tuned) which may turn out to be the root cause for glmark2 -d ideas not rendering everything it should – I’m still investigating this one.

I’ll post a more formal “progress report” once those 2 items are fully dealt with.

It would be really great to have more real GLES2 applications making use of 3D, as mostly everything out there on the Desktop is using Desktop GL and GLX. I have a plan to take some godot3 3D games, which are mostly using GLES3, though it is quite easy to ask godot3 to use GLES2 instead – but that won’t be as good as a game designed for GLES2, we’ll just have very few graphical effects available…

1 Like

progress report

It took some time since last update, and accordingly that’s a pretty big one:

  • a process is forked on server side for each client connecting, providing both resource isolation (and freeing on client termination) and support for several simultaneous clients
  • server-side window is now created at WindowSurface-creation time, and has the expected size (server window does not get resized afterwards, but the displayed size can usually be reduced by making the input window smaller)
  • support for extensions, both for EGL and GLES2, with a couple of them implemented, as required/useful for the test programs at hand
  • successfully tested more apps (prboom, weston)
  • new support status page for tested apps
  • only EGL 1.0 is advertised, now that glmark2 landed a conformance fix allowing it to run
  • code cleanups, doc improvements, and too many bugfixes to list here

As “running weston over X11” implies, that seems to bring some Wayland support, but well… the compositor having access to GPU acceleration does not necessarily mean the wayland apps produce GPU-accelerated streams to be composed, or that meaningful wayland apps can run yet (eg. glmark2-es2-wayland does not yet).

what next ?

It looks like a good point in this prototype to get a look at rendering the window in the GPU domain. I’ll first have a look at running the server on the GPU domain. There are 2 complementary paths there:

  • traditional GPU-in-dom0 setup, but without networking in dom0, will require a virtio-based transport
  • sys-gui-gpu setup, where possibly the current TCP transport can be used as-is (can yield sooner results, but I’ll have to get that sys-gui-gpu to work first on my machine)

With the display being done in the same domain as the server, it will be possible to look at how to use a single X11 window for input and display.

2 Likes

Thank you for your work! Is there any news on the topic?

Thanks for your support :slight_smile:

I have taken some time to be able to passthrough my own GPU to sys-gui-gpu to be able to make use of this nice work myself :wink:
This being slowed by build times, in turn has brought me to see how we could improve the Qubes build system – but it appears not a good moment to work on big changes there, so I’m refocusing on investigating just activating various acceleration options in Mock (ccache, buildchroot caching and such) to make stubdom rebuild times more acceptable.
With a bit more work and some luck I may end up having this sys-gui-gpu working (and this, as being just a personal side-project, can take some time).

But once these digressions get done, I intend to look first into how to leverage the Qubes GUI protocol to avoid the double-window issue (which should make the whole thing realistic to use, even though a couple of important things will still need to be fixed before calling it a first release).

2 Likes

Maybe I’m missing the point here, but why don’t you use a vchan/qrexec-rpc instead of tcp? This would a) solve the issue with lacking networking connection to dom0 and b) implement the “qubes-way” without opening additional loopholes.

Indeed when I’ll have a sys-gui-gpu ready, tunneling the TCP stream through qrexec is already on top of my TODO list (and can already be tested today, there are examples showing how to do that, including the SplitSSH configuration). But it cannot be a long-term solution if we want to implement the GL/VK calls involving shared memory - I still have to get into vchan to see if they can be used, but there may be a need to get down to grant tables, for that particular topic at least.

Just tunneling TCP will not help to get rid of the “double window” issue: while a window gets created in the GPU domain to do the real display job, we still have the X11 window created in the qube where the app runs. Fortunately this window is in fact displayed by the GUI domain, and I’m pretty much convinced the GUI protocol can be leveraged to avoid this extra window frame.

We could argue that TCP + GUI protocol would fit your “without opening additional loopholes” criteria. Whether it is required for correct performance to implement shared memory can be left for later to evaluate (and is not in the top items of my TODO list anyway).

I think there there are two different requirements and TCP won’t solve any of them.
The first is having some sort of socket connection. When available, I’d always prefer a unix socket over a TCP connection because of system resources and latency - you don’t have to go up and down the stack to get your message across. To me qrexec/vchan comes closest to a unix socket and there is a very good reason to not have any networking in dom0 (and probably sys-gui[-gpu] too). So why even bother with TCP when you can also use a socket connection without network? Requiring network in either dom0 or sys-gui[-gpu] will be a showstopper for even considering upstream adoption and/or interest by serious Qubes users.

The other one is direct memory access and/or which protocol layer to tunnel for best performance/functionality. I totally get this point, but there is absolutely nothing that TCP buys you here over qrexec/vchan other than bi-directional establishment of communication.

Keep in mind I forked a project that was (badly) using UDP. TCP is just for the current intermediate state of PoC, and at this stage latency is not the problem to be solved :slight_smile:

TCP/UDP/Unix-sockets buys us the ability to easily develop client and server in the same VM (though maybe we could use vchan to communicate with the local VM, something like a “loopback vchan” ?), as well as testing outside of Qubes to get some comparisons (though I agree that last one is not a deal breaker)

As a first step towards Qubes GUI integration, and as a followup to previous posts, I updated usage instructions with the details for setting up a GLS server in the GPU domain.

My plans for the next step is to dig in qubes-guid so the GLS server can query it for the GUI-domain window backing the display, instead of creating a new one of its own (which currently appears as a belonging to the GPU domain).

Once that is done and a proper design is validated, we’ll be able to address the known issues and aim for a first release. And then go Vulkan :wink: .

5 Likes

This was easier than expected: since qubes-guid sets window property in the gui-domain to point to the source vm and window-id, I was able to just spy for them and create the GLES context on this window.

There is a quick-and-dirty PoC in the wip/qubes-window branch, but things will have to be put straight before it is ready to merge (eg. the TCP mode can’t be used any more in that branch).

I noticed that in fact, socat is really slow: just using it on localhost to forward TCP traffic between local client and server is enough to drop the es2gearts framerate from ~500fps to ~60fps :scream:. That’s probably enough to explain the terrible performance I get with this branch (there are 2 socat processes in the pipe, so if each causes a 9x slowdown, the numbers make sense…).

Next steps:

  • use the qrexec pipe directly to get rid of socat
  • cleanup that messy PoC :slight_smile:
4 Likes

Here is the weekly update^Wteaser :slight_smile:

I now have a PoC for a qrexec transport (in the wip/transports branch. Performance-wise, all benches are capped to my 144Hz framerate, and glamrk2 only gets a global mark of 143 because of its rounding (only the most 2 complex scenes went down to 143, and I believe they would each have got a higher mark without the capping). So great news overall, and a string hint to get rid of that framerate cap :wink:

That branch is a WiP one too, and not built atop wip/qubes-window, so you can’t get all the good stuff in one test yet. This branch is much closer to be mergeable than the latter, though: it already provides compile-time selection of the transport to be used, with choices of TCP (the implementation currently in master), stdio (useful for testing, and for the GUI-domain server behind qrexec), and qrexec-client-vm (communicating by pipes with said tool to abstract qrexec communication details). Some dynamic selection and small cleanups will be needed, but it should land in master real soon now. Only after will it be time for a proper implementation of wip/qubes-window.

Edit: advertizing EGL 1.1 instead of 1.0 (despite only eglSwapInterval being implemented of 1.1) is sufficient to unleash the FPS in glmark2. Current framerates in this version are about 10% of the native speed (yes, -90%, quite some progress margin, though already better than software rendering on the same machine, which performs at 1/25th of the native GPU speed).

6 Likes

Latest news from gl-streaming:

  • support for stdio and qrexec transports was merged to master, selectable as runtime option
  • the glDrawElements operation got an important fix, notably allowing glmark2’s ideas bench to render properly
  • the wip/qubes-window branch was rebased on top of this, which allows to test the latest and greatest in terms of QubesOS integration
  • benchmarks were refreshed on AMD Renoir and Stoney (tl;dr: slowdown by 10 on Renoir, and only by 3 on the much less capable Stoney)

Next focus include:

  • continuing cleaning up wip/qubes-window
  • digging into the performance hit
  • getting a look at how to implement shared-memory APIs (qubes-guid seems to provide a useful example)
  • get a look at what they did in Mesa on Vulkan serialization, see if we can use this to avoid reimplementing the wheel
3 Likes

Quick update, before really diving into shared-memory transport: there was a huge performance hit coming from an old and easily fixed remaining sleep-in-a-loop I had missed, and glmark2 in a qube through ql-streaming now runs barely 3 times slower than natively in dom0.
Details of the benchmark can be found here.

1 Like

Moved to ‘User Support | Guides’