Prior to this point, we were treating them as flattened, as they are in
old-style tessellation control shaders, and still are for structs in
new-style shaders. This is not true for outputs; output composites are
not flattened at all. This semantic mismatch broke a Vulkan CTS test.
It should now pass.
In Metal render pipelines don't have an option to set a sampleMask
parameter, the only way to get that functionality is to set the
sample_mask output of the fragment shader to this value directly.
We also need to take care to combine the fixed sample mask with the
one that the shader might possibly output.
This should hopefully reduce underutilization of the GPU, especially on
GPUs where the thread execution width is greater than the number of
control points.
This also simplifies initialization by reading the buffer directly
instead of using Metal's vertex-attribute-in-compute support. It turns
out the only way in which shader stages are allowed to differ in their
interfaces is in the number of components per vector; the base type must
be the same. Since we are using the raw buffer instead of attributes, we
can now also emit arrays and matrices directly into the buffer, instead
of flattening them and then unpacking them. Structs are still flattened,
however; this is due to the need to handle vectors with fewer components
than were output, and I think handling this while also directly emitting
structs could get ugly.
Another advantage of this scheme is that the extra invocations needed to
read the attributes when there were more input than output points are
now no more. The number of threads per workgroup is now lcm(SIMD-size,
output control points). This should ensure we always process a whole
number of patches per workgroup.
To avoid complexity handling indices in the tessellation control shader,
I've also changed the way vertex shaders for tessellation are handled.
They are now compute kernels using Metal's support for vertex-style
stage input. This lets us always emit vertices into the buffer in order
of vertex shader execution. Now we no longer have to deal with indexing
in the tessellation control shader. This also fixes a long-standing
issue where if an index were greater than the number of vertices to
draw, the vertex shader would wind up writing outside the buffer, and
the vertex would be lost.
This is a breaking change, and I know SPIRV-Cross has other clients, so
I've hidden this behind an option for now. In the future, I want to
remove this option and make it the default.
Metal is picky about interface matching. If the types don't match
exactly, down to the number of vector components, Metal fails pipline
compilation. To support pipelines where the number of components
consumed by the fragment shader is less than that produced by the vertex
shader, we have to fix up the fragment shader to accept all the
components produced.
DXVK emits SPIR-V where fragment shader builtins have names derived from
DXBC assembly, e.g. `oDepth` for `FragDepth`. When we declared the
disabled output, we used this name, but when referencing it, we
continued to use the GLSL name. This breaks compilation.
Like with `point_size` when not rendering points, Metal complains when
writing to a variable using the `[[depth]]` qualifier when no depth
buffer be attached. In that case, we must avoid emitting `FragDepth`,
just like with `PointSize`.
I assume it will also complain if there be no stencil attachment and the
shader write to `[[stencil]]`, or it write to `[[color(n)]]` but there
be no color attachment at n.
Here, the inline uniform block is explicit: we instantiate the buffer
block itself in the argument buffer, instead of a pointer to the buffer.
I just hope this will work with the `MTLArgumentDescriptor` API...
Note that Metal recursively assigns individual members of embedded
structs IDs. This means for automatic assignment that we have to
calculate the binding stride for a given buffer block. For MoltenVK,
we'll simply increment the ID by the size of the inline uniform block.
Then the later IDs will never conflict with the inline uniform block. We
can get away with this because Metal doesn't require that IDs be
contiguous, only monotonically increasing.
MSL does not support this, so we have to emulate it by passing it around
as a varying between stages. We use a special "user(clipN)" attribute
for this rather than locN which is used for user varyings.
To support loading array of array properly in tessellation, we need a
rewrite of how tessellation access chains are handled.
The major change is to remove the implicit unflatten step inside
access_chain which does not take into account the case where you load
directly from a control point array variable.
We defer unflatten step until OpLoad time instead.
This fixes cases where we load array of {array,matrix,struct}.
Removes the hacky path for MSL access chain index workaround.
If there are enough members in an IAB, we cannot use the constant
address space as MSL compiler complains about there being too many
members. Support emitting the device address space instead.
Rolled the hashes used for glslang, SPIRV-Tools, and SPIRV-Headers to
HEAD, which includes the update to 1.5.
Added passing '--amb' to glslang, so I didn't have to explicitly set
bindings in a large number of test shaders that currently don't, and
now glslang considers them invalid.
Marked all shaders that no longer pass spirv-val as .invalid.
Vulkan has two types of buffer descriptors,
`VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC` and
`VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC`, which allow the client to
offset the buffers by an amount given when the descriptor set is bound
to a pipeline. Metal provides no direct support for this when the buffer
in question is in an argument buffer, so once again we're on our own.
These offsets cannot be stored or associated in any way with the
argument buffer itself, because they are set at bind time. Different
pipelines may have different offsets set. Therefore, we must use a
separate buffer, not in any argument buffer, to hold these offsets. Then
the shader must manually offset the buffer pointer.
This change fully supports arrays, including arrays of arrays, even
though Vulkan forbids them. It does not, however, support runtime
arrays. Perhaps later.
Writable textures cannot use argument buffers on iOS. They must be
passed as arguments directly to the shader function. Since we won't know
if a given storage image will have the `NonWritable` decoration at the
time we encode the argument buffer, we must therefore pass all storage
images as discrete arguments. Previously, we were throwing an error if
we encountered an argument buffer with a writable texture in it on iOS.
This was straightforward to implement in GLSL. The
`ShadingRateInterlockOrderedEXT` and `ShadingRateInterlockUnorderedEXT`
modes aren't implemented yet, because we don't support
`SPV_NV_shading_rate` or `SPV_EXT_fragment_invocation_density` yet.
HLSL and MSL were more interesting. They don't support this directly,
but they do support marking resources as "rasterizer ordered," which
does roughly the same thing. So this implementation scans all accesses
inside the critical section and marks all storage resources found
therein as rasterizer ordered. They also don't support the fine-grained
controls on pixel- vs. sample-level interlock and disabling ordering
guarantees that GLSL and SPIR-V do, but that's OK. "Unordered" here
merely means the order is undefined; that it just so happens to be the
same as rasterizer order is immaterial. As for pixel- vs. sample-level
interlock, Vulkan explicitly states:
> With sample shading enabled, [the `PixelInterlockOrderedEXT` and
> `PixelInterlockUnorderedEXT`] execution modes are treated like
> `SampleInterlockOrderedEXT` or `SampleInterlockUnorderedEXT`
> respectively.
and:
> If [the `SampleInterlockOrderedEXT` or `SampleInterlockUnorderedEXT`]
> execution modes are used in single-sample mode they are treated like
> `PixelInterlockOrderedEXT` or `PixelInterlockUnorderedEXT`
> respectively.
So this will DTRT for MoltenVK and gfx-rs, at least.
MSL additionally supports multiple raster order groups; resources that
are not accessed together can be placed in different ROGs to allow them
to be synchronized separately. A more sophisticated analysis might be
able to place resources optimally, but that's outside the scope of this
change. For now, we assign all resources to group 0, which should do for
our purposes.
`glslang` doesn't support the `RasterizerOrdered` UAVs this
implementation produces for HLSL, so the test case needs `fxc.exe`.
It also insists on GLSL 4.50 for `GL_ARB_fragment_shader_interlock`,
even though the spec says it needs either 4.20 or
`GL_ARB_shader_image_load_store`; and it doesn't support the
`GL_NV_fragment_shader_interlock` extension at all. So I haven't been
able to test those code paths.
Fixes#1002.
This command allows the caller to set the base value of
`BuiltInWorkgroupId`, and thus of `BuiltInGlobalInvocationId`. Metal
provides no direct support for this... but it does provide a builtin,
`[[grid_origin]]`, normally used to pass the base values for the stage
input region, which we will now abuse to pass the dispatch base and
avoid burning a buffer binding.
`[[grid_origin]]`, as part of Metal's support for compute stage input,
requires MSL 1.2. For 1.0 and 1.1, we're forced to provide a buffer.
(Curiously, this builtin was undocumented until the MSL 2.2 release. Go
figure.)
This extension provides a new operation which causes a fragment to be
discarded without terminating the fragment shader invocation. The
invocation for the discarded fragment becomes a helper invocation, so
that derivatives will remain defined. The old `HelperInvocation` builtin
becomes undefined when this occurs, so a second new instruction queries
the current helper invocation status.
This is only fully supported for GLSL. HLSL doesn't support the
`IsHelperInvocation` operation and MSL doesn't support the
`DemoteToHelperInvocation` op.
Fixes#1052.
This provides a few functions normally available in OpenCL to the SPIR-V
shader environment. These functions happen to be available in Metal as
well.
No GLSL, unfortunately. Intel has yet to publish a
`GL_INTEL_shader_integer_functions2` spec.
Fix fallout from changes.
There's a bug in glslang that prevents `float16_t`, `[u]int16_t`, and
`[u]int8_t` constants from adding the corresponding SPIR-V capabilities.
SPIRV-Tools, meanwhile, tightened validation so that these constants are
only valid if the corresponding `Float16`, `Int16`, and `Int8` caps are
on. This affects the `16bit-constants.frag` test for GLSL and MSL.
The only piece added by this extension is the `DeviceIndex` builtin,
which tells the shader which device in a grouped logical device it is
running on.
Metal's pipeline state objects are owned by the `MTLDevice` that created
them. Since Metal doesn't support logical grouping of devices the way
Vulkan does, we'll thus have to create a pipeline state for each device
in a grouped logical device. The upcoming peer group support in Metal 3
will not change this. For this reason, for Metal, the device index is
supplied as a constant at pipeline compile time.
There's an interaction between `VK_KHR_device_group` and
`VK_KHR_multiview` in the
`VK_PIPELINE_CREATE_VIEW_INDEX_FROM_DEVICE_INDEX_BIT`, which defines the
view index to be the same as the device index. The new
`view_index_from_device_index` MSL option supports this functionality.
Using the `PostDepthCoverage` mode specifies that the `gl_SampleMaskIn`
variable is to contain the computed coverage mask following the early
fragment tests, which this mode requires and implicitly enables.
Note that unlike Vulkan and OpenGL, Metal places this on the sample mask
input itself, and furthermore does *not* implicitly enable early
fragment testing. If it isn't enabled explicitly with an
`[[early_fragment_tests]]` attribute, the compiler will error out. So we
have to enable that mode explicitly if `PostDepthCoverage` is enabled
but `EarlyFragmentTests` isn't.
For Metal, only iOS supports this; for some reason, Apple has yet to
implement it on macOS, even though many desktop cards support it.
The old method of using a different unpacked matrix type doesn't work
for scalar alignment. It certainly wouldn't have any effect for a square
matrix, since the number of columns and rows are the same. So now we'll
store them as arrays of packed vectors.
Relaxed block layout relaxed the restrictions on vector alignment,
allowing them to be aligned on scalar boundaries. Scalar block layout
relaxes this further, allowing *any* member to be aligned on a scalar
boundary. The requirement that a vector not improperly straddle a
16-byte boundary is also relaxed.
I've also added a test showing that `std430` layout works with UBOs.
I'm troubled by the dual meaning of the `Packed` extended decoration. In
some instances (struct, `float[]`, and `vec2[]` members), it actually
means the exact opposite, that the member needs extra padding. This is
especially problematic for `vec2[]`, because now we need to distinguish
the two cases by checking the array stride. I wonder if this should
actually be split into two decorations.