If a command takes too long to execute, Vulkan drivers will think they
are inflooping and abort what they were doing.
For the simple color shader with smallish nodes, this happens around
10M instances, as tested with the output of
./tests/rendernode-create-tests 10000000 colors.node
So just limit it to way lower, so that we barely never hit it, ut still
pick a big number so this optimization stays noticable.
For small regions, the optimization doesn't matter that much, so we
don't need to do lots of work on the CPU.
In particular, this should catch icons and their backgrounds (32x32),
but I was generous in selecting the number.
Gets my discrete AMD on widget-factory back to the 1900fps it had before
this optimization while making the driver clock the GPU's shader at
1.7GHz instead of the 2.1GHz it used before.
Using clear avoids the shader engine (see last commit), so if we can get
pixels out of it, we should.
So we detect the overlap with the rounded corners of the clip region and
emit shaders for those, but then use Clear() for the rest.
With this in place, widget-factory on my integrated Intel TigerLake gets
a 60% performance boost.
The op emits a vkCmdClearAttachments() with a given color. That can be
used with color nodes that are pixel-aligned and opaque to significantly
speed up rendering when the window background is a solid color.
However, currently this fails a bit outside of fullscreen when rounded
clip rectangles are in use to draw rounded corners.
Instead of using the upload vfunc and going via the code in
GskVulkanImage, copy/paste the relevant code into the command() vfunc.
This is meant to achieve multiple things:
1. Get rid of GskVulkanUploader and its own command buffer and general
non-integration with operations.
2. Get rid of GskVulkanOp:upload()
3. Get the upload/download code machinery for GskVulkanImage and put it
with the actual operations.
The current code can't do direct upload/download, that will follow in a
future commit.
... instead of doing the equivalent things manually by creating a
RenderPass and calling the relevant functions.
Now all renderpass operations are indeed stored in ops.
Also reshuffle the command emission code, because we no longer need to
emit the ops for the base renderpass.
As a result we only submit a single command buffer containing all the
render passes instead of once per render pass.
We also bind vertex buffers and descriptor sets only once now at the
start instead of once per renderpass.
Use the OpClass.stage to order operations:
1. Put upload ops first
This way we can ensure they are executed first.
2. Move subpasses for offscreens in front of the pass using them.
This is a massive refactoring because it collects all the renderops
of all renderpasses into one long array in the Render object.
Lots of code in there is still flaky and needs cleanup. That will
follow in further commits.
Other than that it does work fine though.
All the ops that just execute a shader do pretty much the same stuff, so
put it all in a single function that they all call.
It's basically faking a base class for them.
Instead of recreating the same renderpass object in every frame and for
every offscreen, just reuse it.
Technically, we can save this per-renderer or even per-display (it
should really be cached by VkDevice), but we have no infrastructure for
that.
The function name gsk_vulkan_render_get_pipeline() had been used for
GskVulkanPipeline. Since those are gone now, we can use that name for
VkPipelines.
Renderpasses get recreated every frame, but we keep render objects
around. So if we keep the vertex buffer in the render object, we can
also keep it around and just reuse it.
Also, we only need one buffer for all the render passes, which is
another bonus.
The initial buffer size is chosen at 128kB. Maximized Nautilus,
gnome-text-editor with an open file and widget-factory take ~100kB when
doing a full redraw. Other apps are between 30-50kB usually.
So I chose a value that is not too big, but catches ~90% of cases.
Interning strings is slow, especially if we can instead do direct
pointer compares.
Also refactor the pipeline lookup code a bit to make use of the
refactored code.
Set it after creating all the ops and then use it for iterating.
Note that we cannot set it while creating the ops because the array may
be realloc()ed into a different memory region which would invalidate all
the pointers.
It currently has no use, but that will come later.
Also put the typedefs into headers in gsk/vulkan, they have nthing to do
outside that directory.
Remove the function to add a node from both the GskVulkanRender and the
GskVulkanRenderPass.
That means they are both now meant to draw exactly one node.
This is a rudimentary - but working - port.
Glyph uploads are still using the old machinery, a bunch of functions
still exist that probably aren't necessary anymore and each glyph emits
its own node.
This will need to be improved in further commits.
This shader is an updated version of the mask shader, but I want to use
the mask name for the mask node and that's a different functionality.
Also, add an operation for it and partially implement the mask node
using it, so we can test that this shader works.
Replacing the shader used for text rendering is the next step.
The benefit here is that we can now properly cross-fade when one of
start/end is fully clipped out by just replacing it with an opacity op
for the other.
This was not possible with the old way we did things.
Instead of creating a pipeline GObject, just ask for the VkPipeline.
And instead of having the Op handle it, just let the renderpass look
up/create the relevant pipeline while creating commands so that it can
insert vkCmdBindPipeline calls as-needed.
This reverts commit 0f184d3270.
The renderer is good enough to make use of the clip region.
Or rather: If it isn't, the renderpass should take care of that, not the
render object.
This reverts most of commit f420c143e0
again because it turns out GPUs like combined images and samplers.
But: The one thing we don't revert is allowing the C code to select any
combination of sampler and image:
gsk_vulkan_render_get_image_descriptor() now takes a 2nd argument
specifying the sampler.
This allows the same flexibility as before, we just combine things
early.
This change was inspired by
https://developer.nvidia.com/blog/vulkan-dos-donts/
Have a resource path => vkShaderModule hash table instead of doing fancy
custom objects.
A benefit is that shader modules are now shared between all renderers
and pipelines.
Instead of creating the op manually, just pass in the renderpass and
have the op created from there.
This way ops aren't really initialized anymore, they are more appended
to the queue, so instead of foo_op_init() we can just call the function
foo_op().
The new code always uses an offscreen, even for children that are
exactly fitting texture nodes.
I would have had to write more code and didn't consider it worth it,
especially because it would have required complicating the
get_as_image() function.
This was the last node using the texture pipeline.
Instead of having one function that gets the image for the texture and
uploads it if it doesn't exist yet, make it 2 functions:
One to get the texture if it exists.
One to assign an uploaded image to the texture.
This way, we can potentially do the upload ourselves.
Allocate the memory up front instead of passing the Op into it.
This way, we can split ops into their own source file and use
init/finish style to use them.
GskVulkanOp is meant to be a proper abstraction of operations
the Vulkan renderer will be doing.
For now it's an atrocious clunky piece of junk wedged into the
renderpass codebase.
It's so temporary that I didn't even adjust indentation of the code.
Intersection with a roudned clip takes too long.
Instead, rename the function to may_intersect() to be clear about what
it does and then just intersect with the regular rectangle.
If we don't clip anything, we stil have bounds - either the framebuffer
size or (more likely) the scissor rect. And we don't want to draw
anything that is outside these bounds.
So clip in those cases, too.
Stops gtk4-demo --run=listbox from trying to render the whole listbox
instead of only the visible parts.
Use G_TYPE_CHECK_INSTANCE_TYPE() instead of just checking for != NULL.
After all, this is a GTypeInstance.
Also fixes some gcc complaints when checking
node == NULL || GSK_IS_RENDERNODE (node)
which gcc was convinced would be always true.
The match operator was added in Python 3.10, which is a bit too new for
some downstreams.
While at it, let's fix the flake8 errors and warnings.
Fixes: #5934
In particular, fix the combination of luminance and alpha. We want to do
mask = luminance * alpha
and for inverted
mask = (1.0 - luminance) * alpha
so add a test that makes sure we do that and then fix the code and
existing tests to conform to it.
When color-matrix modifying a clear surface, the surface would remain
clear according to Cairo.
That's very unfortunate when we prepare a mask for inverted-alpha
masking.
If we build our own targets, we need to include those.
This is only relevant when adding new shaders because meson will
complain that the (unused) sources don't exist as it tries to include
those.
And that will make the build.ninja file not be generated which would
have build those shaders and would have allowed to copy them into the
sources.
Note that this makes builds with glslc not care about all the shader
files being included with the sources, but we have CI to check that.
In particular, catch radius values being < 0 by return_if_fail()ing in
the rendernode creation code, and by erroring out in the rendernode
parser.
I try too much dumb stuff in the node editor.
The IFUNC resolvers that we are using here get
run early, before asan had a chance to set up its
plumbing, and therefore things go badly if they
are compiled with asan. Turning it off makes things
work again.
The gcc bug tracking this problem:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110442
Thanks to Jakub Jelinek and Florian Weimer for
analyzing this and recommending the workaround.
g_hash_table_insert() frees the given key if it already exists
in the hashtable. But since we use the same pointer in the
following line, it will result in use-after-free.
So instead, insert the key only if it doesn't exist.
Make the display handle the cache, because we only need one.
We store the cache in
$CACHE_DIR/gtk-4.0/vulkan-pipeline-cache/$UUID.$VERSION
so we regenerate caches for each different device (different UUID) and
each different driver version.
We also keep track of the etag of the cache file, so if 2 different
applications update the cache, we can detect that.
Vulkan allows merging caches, so the 2nd app reloads the new cache file
and merges it into its cache before saving.
It turns out variable length is only supported for the last binding in
a set, not for every binding.
So we need to create one set for each of our arrays.
[ VUID-VkDescriptorSetLayoutBindingFlagsCreateInfo-pBindingFlags-03004 ] Object 0: handle = 0x33a9f10, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xd3f353a | vkCreateDescriptorSetLayout(): pBindings[0] has VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT but 0 is the largest value of all the bindings. The Vulkan spec states: If an element of pBindingFlags includes VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT, then all other elements of VkDescriptorSetLayoutCreateInfo::pBindings must have a smaller value of binding (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkDescriptorSetLayoutBindingFlagsCreateInfo-pBindingFlags-03004)
Somebody (me) had flipped the 2 flags in commit ba28971a18:
[ VUID-vkCmdCopyBufferToImage-srcBuffer-00174 ] Object 0: handle = 0x3cfaac0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x430000000043, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xe1b276a1 | Invalid usage flag for VkBuffer 0x430000000043[] used by vkCmdCopyBufferToImage. In this case, VkBuffer should have VK_BUFFER_USAGE_TRANSFER_SRC_BIT set during creation. The Vulkan spec states: srcBuffer must have been created with VK_BUFFER_USAGE_TRANSFER_SRC_BIT usage flag (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkCmdCopyBufferToImage-srcBuffer-00174)
If a node has a higher depth, pick the RGBA format that has that depth
as the texture format we're renderig to with render_texture().
Support for adapting the swapchain is not part of this.
When a GdkMemoryFormat isn't supported, pick close formats that have a
higher chance of being supported.
Make sure this works recursively and the whole loop always ends up at
R8G8B8A8_UNORM because that one is mandatory.
Roughly, follow these rules:
1. Drop the unpremultiplied
2. Expand channels to include all of RGBA
3. pick swizzle that is RGBA
4. pick next largest depth
5. pick R8G8B8A8_UNORM
This way, we unify the code paths for memory access to textures.
We also technically gain the ability to modify images, though I have no
use case for this.
That way, the offscreen can create images of different types.
Its not used in this commit, but will come in handy when we want to
support high bit depth.
Pretty much copy what GL does and just use the default display to create
GPU-related resources without the need for a display.
This also adds gdk_display_create_vulkan_context() but I've
kept it private because the Vulkan API is generally considered in flux,
in particular with our pending attempts to redo how renderers work.
Fixes a bug introduced in d1135f9e3c.
Luckily the buffer was large enough that all my testing didn't catch it
because it took a few minutes to overflow.
Replace gdk_memory_format_prefers_high_depth with the more generic
gdk_memory_format_get_depth() that returns the depth of the individual
channels.
Also make the GL renderer use that to pick the generic F16 format
instead of immediately going for F32 when uploading textures.
Now that we don't use the old environment variables anymore to force
staging buffer/image uploads, we don't need them.
However, we do autodetect the fast path for avoiding a staging buffer
now, and we might want to be able to turn that off for testing.
So add GSK_DEBUG=staging that does exactly that.
This is unused now that all the code uses map/unmap.
The only thing that map/unmap doesn't do that the old code did, was use
a staging image instead as alternative to a staging buffer for image
uploads.
However, that code is not necessary for anything, so I'm sure we can do
without.
If the memory heap that the GPU uses allows CPU access
(which is the case on basically every integrated GPU, including phones),
we can avoid a staging buffer and write directly into the image memory.
Check for this case and do that automatically.
Unfortunately we need to change the image format we use from
VK_IMAGE_TILING_OPTIMAL to VK_IMAGE_TILING_LINEAR, I haven't found a way
around that yet.
Use the new map/unmap image upload method for Cairo node drawing:
1. map() the memory
2. create an image surface or that memory
3. draw to that image surface
4. success
There's no longer a need for Cairo to allocate image memory.
As an alternative to gsk_vulkan_image_new_from_data() that
takes a given data and creates an image from it, add a 3 step process:
gsk_vulkan_image_new_for_upload()
gsk_vulkan_image_map_memory()
/* put data into memory */
gsk_vulkan_image_unmap_memory()
The benefit of this approach is that it potentially avoids a copy;
instead of creating a buffer to pass and writing the data into it before
then memcpy()ing it into the image, the data can be written straight
into image memory.
So far, only the staging buffer upload is implemented.
There are also no users, those come in the next commit(s).
When nodes are added, nothing was warning us that we need to bump
N_RENDER_NODES.
Make sure that that's no longer necessary by refactoring the code to
remove the define.
This is more expensive, but it finds more cases, and in particular it
catches corner cases like empty nodes or fully clipped nodes that might
otherwise make the kernel throw signals in our direction.
If one of the descriptor sets doesn't have any items, don't include it
in the sets passed to vkUpdateDescriptorSets().
This has no effect right now, because we either have both images and
samplers or neither, but it will become relevant once we also support
buffers.
Respect the matrix in use at time of encountering a repeat node so that
the offscreen uses roughly the same device pixel density as the target.
Fixes the handling of the clipped-repeat test.
Sometimes the GPU is still busy when the next frame starts (like when
no-vsync benchmarking), so we need to keep all those resources alone and
create new ones.
That's what the render object is for, so we just create another one.
However, when we create too many, we'll starve the CPU. So we'll limit
it. Currently, that limit is at 4, but I've never reached it (I've also
not starved the GPU yet), so that number may want to be set lower/higher
in the future.
Note that this is different from the number of outstanding buffers, as
those are not busy on the GPU but on the compositor, and as such a
buffer may have not finished rendering but have been returend from the
compositor (very busy GPU) or have finished rendering but not been
returned from the compositor (very idle GPU).
The idea here is that we can do more complex combinations and use that
to support texture-scale nodes or use fancy texture formats (suc as
YUV).
I'm not sure this is actually necessary, but for now it gives more
flexibility.
For blend and crossfade nodes, one of the children may exist and
influence the rendering, while the other does not.
Previously, we would skip the node, which would cause the required
rendering to not happen. We now send a valid texture id for the
invalid offscreen, thereby actually rendering the required parts.
Fixes the blend-invisible-child compare test
Current state for compare tests:
Ok: 397
Expected Fail: 0
Fail: 26
Unexpected Pass: 0
Skipped: 2
Timeout: 0