Streaming OpenGL/Vulkan calls to sys-gui-gpu

Getting accelerated 3D on Qubes is probably one of the biggest pain points. Even if everyone could forward a second GPU to a single domain using PCI passthrough (without even talking about the difficulties involved), we would not be able to do that to 2 domains simultaneously, and would have to face security implications even when dealing with 2 VMs non-simultaneously.

But there is another approach that (hopefully) would allow to overcome all of this: streaming the OpenGL calls from individual domains to a “GPU server” in sys-gui-gpu.

I played with the idea, happy that prior art existed - although trying that original codebase (written for rendering on Android, and itself based on a codebase for rendering on a Raspberry Pi, no shit) requires some amount of motivation.

I’m not quite to the point of getting this to run in sys-gui-gpu (partly because I don’t have one running yet :wink: ), but mostly because the original code is incomplete, only works when compiled as 32bit code (and thus with a 32bit apps), uses an insecure network protocol sharing and accepting pointers, and other fun stuff.

Nevertheless, I’ve started to poke at it as an experiment, to the point I’ve been able to run es2gears. That is, a 32bit version of it, with the “GLES server” running in the same domain with Mesa software rendering. And the small window that’s running 1000fps with plain Mesa runs at an anemic 70fps. That may sound not so promising, but given the state of the stack there is quite some room for improvement around 220fps which is finally not so bad given the room for improvement, and I’m submitting this to your thoughts.

Have fun :slight_smile:

Edit: a simple removal of apparently-arbitrary throttling already turned the original 15x slowdown into a mere 5x slowdown. Stay tuned for hopefully better stuff :slight_smile:

7 Likes

progress report

2 weeks after this heads-up, things have progressed nicely (given the amout of time I’ve been able to put into this). Progress highlights:

  • now testable with 64bit apps (while loosing temporarily the ability to work on 32bit)
  • texture support now working
  • a number of fixes to original forwarding of some APIs was done, and more were implemented

Much remains to be done for real-life usage, but more and more glmark2 benchmarks now run (textures, shading, bump, build:use-vbo=true, pulsar, function, conditionals).

High priority short term work:

  • provide at least a stub for all GLES2 core functions (all apps out there carelessly suppose the full API is implemented, and currently happily proceed calling NULL)
  • get an app’s graphic resources freed in the server when a client disconnects (we sometimes have fun interactions, not mentioning saturating memory)
4 Likes

progress report

I’ve been postponing a status update for a few days as I kept stumbling on problems preventing to see the actual progress, but here are the highlights.

  • near-complete rewrite of the limited UDP networking, with a TCP implementation. No more limitation on the size of textures and other buffers to be uploaded to the GPU.
  • lots of efforts into code cleanup and internals documentation (and still lots on my todo list, and not everything listed in the TODO file :wink: )
  • many bugs addressed
  • integrated and finalized useful work from all public upstream branches on github
  • last but not least, more coverage of both EGL and GLES2 APIs:
    • glmark2 can now be launched for its default set of benchmarks, the still-not-implemented functions don’t prevent it to run everything it can (it does skip all tests needing an extension, 2 tests miss a few crucial APIs, only 1 test clearly lacks something without any warning to give a clue)
    • Bobby Volley 2 can be launched and played (far from an AAA game, but hey, that’s still a real game, and it is playable despite a few missing APIs)

Next focus:

  • more coverage and some benchmark numbers
  • finally deal with the graphic resources isolation and freeing already mentioned, some crucial infrastructure work is already there
  • finally rework the protocol to get rid of those pointers-on-the-wire (yuck)
  • EGL/GLES extension handling
4 Likes

This seems like a really good idea, maybe we’ll even see a gaming qube in the future? Anyways, nice work!

performance

I made a quick FPS comparison, using today’s 0b6cbf7265314 revision from git master, with the glmark2 default test (windowed, 800x600), comparing:

  • running natively on an AMD Stoney APU
  • running through gl-streaming through the loopback network interface
  • running glmark2 on a higher-end (QubesOS) laptop, streaming to the AMD Stoney APU through gigabit ethernet

Keep in mind that this is still preliminary code, with no profiling/optimisation done yet (I even removed most upstream optimisations, which were slowing down feature progress).

And all of this, obviously, is just statistics :slight_smile:

Potentially interesting information:

  • from GLS alone, most tests suffer about 60% drop in FPS, some perform noticeably better, but one (the famous ideas scene, which does not render properly to start with) suffers from nearly 90% FPS drop
  • the tests that show to be the most demanding in the native case usually don’t suffer as much from being used through GLS – that could be a rather good sign for real-life loads
  • if we add a real network in the picture, most tests show 5-10% additional loss, except:
    • one test using client-data instead of vbo (where we likely transfer the data several times, no surprise that’s great handicap when that happens over the network)
    • several tests show a better score than with GLS alone, which seems to hint that in those particular cases, the more powerful CPU can indeed compensate for part of the loss in GPU throughput (could those be CPU-bound benchmarks of the GPU ? would not sound good)
1 Like

This is cool. Please keep us up to date on your progress!

I’m currently working on proper support for EGL and GLES extensions, with the former pretty much working and the latter still get details wired. In parallel I’ve found a set of examples from a GLES2 book which put the spotlight on a nasty issue (for which I pushed a wrong fix to master, stay tuned) which may turn out to be the root cause for glmark2 -d ideas not rendering everything it should – I’m still investigating this one.

I’ll post a more formal “progress report” once those 2 items are fully dealt with.

It would be really great to have more real GLES2 applications making use of 3D, as mostly everything out there on the Desktop is using Desktop GL and GLX. I have a plan to take some godot3 3D games, which are mostly using GLES3, though it is quite easy to ask godot3 to use GLES2 instead – but that won’t be as good as a game designed for GLES2, we’ll just have very few graphical effects available…

1 Like

progress report

It took some time since last update, and accordingly that’s a pretty big one:

  • a process is forked on server side for each client connecting, providing both resource isolation (and freeing on client termination) and support for several simultaneous clients
  • server-side window is now created at WindowSurface-creation time, and has the expected size (server window does not get resized afterwards, but the displayed size can usually be reduced by making the input window smaller)
  • support for extensions, both for EGL and GLES2, with a couple of them implemented, as required/useful for the test programs at hand
  • successfully tested more apps (prboom, weston)
  • new support status page for tested apps
  • only EGL 1.0 is advertised, now that glmark2 landed a conformance fix allowing it to run
  • code cleanups, doc improvements, and too many bugfixes to list here

As “running weston over X11” implies, that seems to bring some Wayland support, but well… the compositor having access to GPU acceleration does not necessarily mean the wayland apps produce GPU-accelerated streams to be composed, or that meaningful wayland apps can run yet (eg. glmark2-es2-wayland does not yet).

what next ?

It looks like a good point in this prototype to get a look at rendering the window in the GPU domain. I’ll first have a look at running the server on the GPU domain. There are 2 complementary paths there:

  • traditional GPU-in-dom0 setup, but without networking in dom0, will require a virtio-based transport
  • sys-gui-gpu setup, where possibly the current TCP transport can be used as-is (can yield sooner results, but I’ll have to get that sys-gui-gpu to work first on my machine)

With the display being done in the same domain as the server, it will be possible to look at how to use a single X11 window for input and display.

2 Likes

Thank you for your work! Is there any news on the topic?

Thanks for your support :slight_smile:

I have taken some time to be able to passthrough my own GPU to sys-gui-gpu to be able to make use of this nice work myself :wink:
This being slowed by build times, in turn has brought me to see how we could improve the Qubes build system – but it appears not a good moment to work on big changes there, so I’m refocusing on investigating just activating various acceleration options in Mock (ccache, buildchroot caching and such) to make stubdom rebuild times more acceptable.
With a bit more work and some luck I may end up having this sys-gui-gpu working (and this, as being just a personal side-project, can take some time).

But once these digressions get done, I intend to look first into how to leverage the Qubes GUI protocol to avoid the double-window issue (which should make the whole thing realistic to use, even though a couple of important things will still need to be fixed before calling it a first release).

2 Likes

Maybe I’m missing the point here, but why don’t you use a vchan/qrexec-rpc instead of tcp? This would a) solve the issue with lacking networking connection to dom0 and b) implement the “qubes-way” without opening additional loopholes.

Indeed when I’ll have a sys-gui-gpu ready, tunneling the TCP stream through qrexec is already on top of my TODO list (and can already be tested today, there are examples showing how to do that, including the SplitSSH configuration). But it cannot be a long-term solution if we want to implement the GL/VK calls involving shared memory - I still have to get into vchan to see if they can be used, but there may be a need to get down to grant tables, for that particular topic at least.

Just tunneling TCP will not help to get rid of the “double window” issue: while a window gets created in the GPU domain to do the real display job, we still have the X11 window created in the qube where the app runs. Fortunately this window is in fact displayed by the GUI domain, and I’m pretty much convinced the GUI protocol can be leveraged to avoid this extra window frame.

We could argue that TCP + GUI protocol would fit your “without opening additional loopholes” criteria. Whether it is required for correct performance to implement shared memory can be left for later to evaluate (and is not in the top items of my TODO list anyway).

I think there there are two different requirements and TCP won’t solve any of them.
The first is having some sort of socket connection. When available, I’d always prefer a unix socket over a TCP connection because of system resources and latency - you don’t have to go up and down the stack to get your message across. To me qrexec/vchan comes closest to a unix socket and there is a very good reason to not have any networking in dom0 (and probably sys-gui[-gpu] too). So why even bother with TCP when you can also use a socket connection without network? Requiring network in either dom0 or sys-gui[-gpu] will be a showstopper for even considering upstream adoption and/or interest by serious Qubes users.

The other one is direct memory access and/or which protocol layer to tunnel for best performance/functionality. I totally get this point, but there is absolutely nothing that TCP buys you here over qrexec/vchan other than bi-directional establishment of communication.

Keep in mind I forked a project that was (badly) using UDP. TCP is just for the current intermediate state of PoC, and at this stage latency is not the problem to be solved :slight_smile:

TCP/UDP/Unix-sockets buys us the ability to easily develop client and server in the same VM (though maybe we could use vchan to communicate with the local VM, something like a “loopback vchan” ?), as well as testing outside of Qubes to get some comparisons (though I agree that last one is not a deal breaker)