Progress reports: streaming OpenGL/Vulkan calls to sys-gui-gpu

performance

I made a quick FPS comparison, using todayā€™s 0b6cbf7265314 revision from git master, with the glmark2 default test (windowed, 800x600), comparing:

  • running natively on an AMD Stoney APU
  • running through gl-streaming through the loopback network interface
  • running glmark2 on a higher-end (QubesOS) laptop, streaming to the AMD Stoney APU through gigabit ethernet

Keep in mind that this is still preliminary code, with no profiling/optimisation done yet (I even removed most upstream optimisations, which were slowing down feature progress).

And all of this, obviously, is just statistics :slight_smile:

Potentially interesting information:

  • from GLS alone, most tests suffer about 60% drop in FPS, some perform noticeably better, but one (the famous ideas scene, which does not render properly to start with) suffers from nearly 90% FPS drop
  • the tests that show to be the most demanding in the native case usually donā€™t suffer as much from being used through GLS ā€“ that could be a rather good sign for real-life loads
  • if we add a real network in the picture, most tests show 5-10% additional loss, except:
    • one test using client-data instead of vbo (where we likely transfer the data several times, no surprise thatā€™s great handicap when that happens over the network)
    • several tests show a better score than with GLS alone, which seems to hint that in those particular cases, the more powerful CPU can indeed compensate for part of the loss in GPU throughput (could those be CPU-bound benchmarks of the GPU ? would not sound good)
2 Likes

This is cool. Please keep us up to date on your progress!

Iā€™m currently working on proper support for EGL and GLES extensions, with the former pretty much working and the latter still get details wired. In parallel Iā€™ve found a set of examples from a GLES2 book which put the spotlight on a nasty issue (for which I pushed a wrong fix to master, stay tuned) which may turn out to be the root cause for glmark2 -d ideas not rendering everything it should ā€“ Iā€™m still investigating this one.

Iā€™ll post a more formal ā€œprogress reportā€ once those 2 items are fully dealt with.

It would be really great to have more real GLES2 applications making use of 3D, as mostly everything out there on the Desktop is using Desktop GL and GLX. I have a plan to take some godot3 3D games, which are mostly using GLES3, though it is quite easy to ask godot3 to use GLES2 instead ā€“ but that wonā€™t be as good as a game designed for GLES2, weā€™ll just have very few graphical effects availableā€¦

1 Like

progress report

It took some time since last update, and accordingly thatā€™s a pretty big one:

  • a process is forked on server side for each client connecting, providing both resource isolation (and freeing on client termination) and support for several simultaneous clients
  • server-side window is now created at WindowSurface-creation time, and has the expected size (server window does not get resized afterwards, but the displayed size can usually be reduced by making the input window smaller)
  • support for extensions, both for EGL and GLES2, with a couple of them implemented, as required/useful for the test programs at hand
  • successfully tested more apps (prboom, weston)
  • new support status page for tested apps
  • only EGL 1.0 is advertised, now that glmark2 landed a conformance fix allowing it to run
  • code cleanups, doc improvements, and too many bugfixes to list here

As ā€œrunning weston over X11ā€ implies, that seems to bring some Wayland support, but wellā€¦ the compositor having access to GPU acceleration does not necessarily mean the wayland apps produce GPU-accelerated streams to be composed, or that meaningful wayland apps can run yet (eg. glmark2-es2-wayland does not yet).

what next ?

It looks like a good point in this prototype to get a look at rendering the window in the GPU domain. Iā€™ll first have a look at running the server on the GPU domain. There are 2 complementary paths there:

  • traditional GPU-in-dom0 setup, but without networking in dom0, will require a virtio-based transport
  • sys-gui-gpu setup, where possibly the current TCP transport can be used as-is (can yield sooner results, but Iā€™ll have to get that sys-gui-gpu to work first on my machine)

With the display being done in the same domain as the server, it will be possible to look at how to use a single X11 window for input and display.

3 Likes

Thank you for your work! Is there any news on the topic?

Thanks for your support :slight_smile:

I have taken some time to be able to passthrough my own GPU to sys-gui-gpu to be able to make use of this nice work myself :wink:
This being slowed by build times, in turn has brought me to see how we could improve the Qubes build system ā€“ but it appears not a good moment to work on big changes there, so Iā€™m refocusing on investigating just activating various acceleration options in Mock (ccache, buildchroot caching and such) to make stubdom rebuild times more acceptable.
With a bit more work and some luck I may end up having this sys-gui-gpu working (and this, as being just a personal side-project, can take some time).

But once these digressions get done, I intend to look first into how to leverage the Qubes GUI protocol to avoid the double-window issue (which should make the whole thing realistic to use, even though a couple of important things will still need to be fixed before calling it a first release).

2 Likes

Maybe Iā€™m missing the point here, but why donā€™t you use a vchan/qrexec-rpc instead of tcp? This would a) solve the issue with lacking networking connection to dom0 and b) implement the ā€œqubes-wayā€ without opening additional loopholes.

Indeed when Iā€™ll have a sys-gui-gpu ready, tunneling the TCP stream through qrexec is already on top of my TODO list (and can already be tested today, there are examples showing how to do that, including the SplitSSH configuration). But it cannot be a long-term solution if we want to implement the GL/VK calls involving shared memory - I still have to get into vchan to see if they can be used, but there may be a need to get down to grant tables, for that particular topic at least.

Just tunneling TCP will not help to get rid of the ā€œdouble windowā€ issue: while a window gets created in the GPU domain to do the real display job, we still have the X11 window created in the qube where the app runs. Fortunately this window is in fact displayed by the GUI domain, and Iā€™m pretty much convinced the GUI protocol can be leveraged to avoid this extra window frame.

We could argue that TCP + GUI protocol would fit your ā€œwithout opening additional loopholesā€ criteria. Whether it is required for correct performance to implement shared memory can be left for later to evaluate (and is not in the top items of my TODO list anyway).

I think there there are two different requirements and TCP wonā€™t solve any of them.
The first is having some sort of socket connection. When available, Iā€™d always prefer a unix socket over a TCP connection because of system resources and latency - you donā€™t have to go up and down the stack to get your message across. To me qrexec/vchan comes closest to a unix socket and there is a very good reason to not have any networking in dom0 (and probably sys-gui[-gpu] too). So why even bother with TCP when you can also use a socket connection without network? Requiring network in either dom0 or sys-gui[-gpu] will be a showstopper for even considering upstream adoption and/or interest by serious Qubes users.

The other one is direct memory access and/or which protocol layer to tunnel for best performance/functionality. I totally get this point, but there is absolutely nothing that TCP buys you here over qrexec/vchan other than bi-directional establishment of communication.

Keep in mind I forked a project that was (badly) using UDP. TCP is just for the current intermediate state of PoC, and at this stage latency is not the problem to be solved :slight_smile:

TCP/UDP/Unix-sockets buys us the ability to easily develop client and server in the same VM (though maybe we could use vchan to communicate with the local VM, something like a ā€œloopback vchanā€ ?), as well as testing outside of Qubes to get some comparisons (though I agree that last one is not a deal breaker)

As a first step towards Qubes GUI integration, and as a followup to previous posts, I updated usage instructions with the details for setting up a GLS server in the GPU domain.

My plans for the next step is to dig in qubes-guid so the GLS server can query it for the GUI-domain window backing the display, instead of creating a new one of its own (which currently appears as a belonging to the GPU domain).

Once that is done and a proper design is validated, weā€™ll be able to address the known issues and aim for a first release. And then go Vulkan :wink: .

5 Likes

This was easier than expected: since qubes-guid sets window property in the gui-domain to point to the source vm and window-id, I was able to just spy for them and create the GLES context on this window.

There is a quick-and-dirty PoC in the wip/qubes-window branch, but things will have to be put straight before it is ready to merge (eg. the TCP mode canā€™t be used any more in that branch).

I noticed that in fact, socat is really slow: just using it on localhost to forward TCP traffic between local client and server is enough to drop the es2gearts framerate from ~500fps to ~60fps :scream:. Thatā€™s probably enough to explain the terrible performance I get with this branch (there are 2 socat processes in the pipe, so if each causes a 9x slowdown, the numbers make senseā€¦).

Next steps:

  • use the qrexec pipe directly to get rid of socat
  • cleanup that messy PoC :slight_smile:
4 Likes

Here is the weekly update^Wteaser :slight_smile:

I now have a PoC for a qrexec transport (in the wip/transports branch. Performance-wise, all benches are capped to my 144Hz framerate, and glamrk2 only gets a global mark of 143 because of its rounding (only the most 2 complex scenes went down to 143, and I believe they would each have got a higher mark without the capping). So great news overall, and a string hint to get rid of that framerate cap :wink:

That branch is a WiP one too, and not built atop wip/qubes-window, so you canā€™t get all the good stuff in one test yet. This branch is much closer to be mergeable than the latter, though: it already provides compile-time selection of the transport to be used, with choices of TCP (the implementation currently in master), stdio (useful for testing, and for the GUI-domain server behind qrexec), and qrexec-client-vm (communicating by pipes with said tool to abstract qrexec communication details). Some dynamic selection and small cleanups will be needed, but it should land in master real soon now. Only after will it be time for a proper implementation of wip/qubes-window.

Edit: advertizing EGL 1.1 instead of 1.0 (despite only eglSwapInterval being implemented of 1.1) is sufficient to unleash the FPS in glmark2. Current framerates in this version are about 10% of the native speed (yes, -90%, quite some progress margin, though already better than software rendering on the same machine, which performs at 1/25th of the native GPU speed).

6 Likes

Latest news from gl-streaming:

  • support for stdio and qrexec transports was merged to master, selectable as runtime option
  • the glDrawElements operation got an important fix, notably allowing glmark2ā€™s ideas bench to render properly
  • the wip/qubes-window branch was rebased on top of this, which allows to test the latest and greatest in terms of QubesOS integration
  • benchmarks were refreshed on AMD Renoir and Stoney (tl;dr: slowdown by 10 on Renoir, and only by 3 on the much less capable Stoney)

Next focus include:

  • continuing cleaning up wip/qubes-window
  • digging into the performance hit
  • getting a look at how to implement shared-memory APIs (qubes-guid seems to provide a useful example)
  • get a look at what they did in Mesa on Vulkan serialization, see if we can use this to avoid reimplementing the wheel
3 Likes

Quick update, before really diving into shared-memory transport: there was a huge performance hit coming from an old and easily fixed remaining sleep-in-a-loop I had missed, and glmark2 in a qube through ql-streaming now runs barely 3 times slower than natively in dom0.
Details of the benchmark can be found here.

1 Like

Moved to ā€˜User Support | Guidesā€™

Well, Iā€™m not sure this thread qualifies as a proper guide - and I would not encourage the deployment of gl-streaming in its current state. It is really still experimental, and I post those updates to let people know whatā€™s coming, hoping others will join to make this a reality faster :wink:

@yann I agree and see now that I made a mistake moving this out of General Discussion. Will reverse that, sorry.

Itā€™s been some time since the last update - large impact on the code for shared-memory support, and too little time to work on this :wink:
On the way to using shared memory between a qube and its sys-gui-gpu, I just reached a nice milestone, even if it does not translate into any immediate improvement for the Real Thing: a PoC for local inter-process shared-memory has promising benchmarks: the overall glmark2 result shows the cost of the gl-streaming processing drop by half when compared to the previously-best setup (stdio transport), with a score only 20% below the native one. A couple of complex benches still show a significantly larger overhead (more than 70%), which likely show that real-life use will still need more work.

This branch is really still a mess for now, and nowhere near mainlining; I intend to see first how much those changes live up to sharing memory between VMs ā€“ and then get back to finishing all those PoC branches into something solid enough for larger publication.

2 Likes

Iā€™ve been using Qubes for many years now (I started with 3.2, or was it 3.1?) and I have slowly been coming to the conclusion that what you are working on is something that is going to be absolutely critical in the not far future.

More and more software is starting to require a GPU for acceptable performance. Already a lot of web sites simply doesnā€™t work on Qubes because they use webgl. Even applications such as Libreoffice uses GPU.

My main laptop is a T490 running Qubes, but I currently have an extra workstation that does not run Qubes, and this is only because I need the GPU on it (not necessarily for games, but just to be able to use a web browser in 4k at a reasonable speed). I think for a lot of people, the alternative to using the GPU in Qubes is not ā€œdonā€™t use the GPUā€, but rather ā€œdonā€™t use Qubesā€, and thatā€™s quantifiably worse than lack of security caused by sharing of the GPU.

Iā€™m posting this because even though I personally am happy with my Qubes laptop, I also want to see more people using Qubes, but the value proposition is very hard when they are being asked to let go of GPU support. Iā€™m hoping that this project may be the solution to this at some point.

3 Likes