When gl_Position is defined by SPIR-V, but neither used nor initialized,
it appeared twice in the MSL output, as gl_Position and glPosition_1.
The existing tests for whether an output is active check only that it is
used by an op, or initialized. Adding the implicit gl_Position also marked
the existing gl_Position as active, duplicating the output variable.
Fix is that when checking for the need to add an implicit gl_Position
output, also check if the var is already defined in the shader,
and just needs to be marked as active.
Add test shader.
Vulkan specifies that the Sample Mask Test occurs before fragment shading.
This means gl_SampleMaskIn should be influenced by both sample-shading and
VkPipelineMultisampleStateCreateInfo::pSampleMask.
CTS tests dEQP-VK.pipeline.multisample_shader_builtin.* bear this out.
For sample-shading, gl_SampleMaskIn should only have a single bit set,
Since Metal does not filter for this, apply a bitmask based on gl_SampleID.
For a fixed sample mask, since Metal is unaware of
VkPipelineMultisampleStateCreateInfo::pSampleMask, we need to ensure that
we apply it to both gl_SampleMaskIn and gl_SampleMask. This has the side
effect of a redundant application of pSampleMask if the shader already
includes gl_SampleMaskIn when setting gl_SampleMask, but I don't see an
easy way around this.
Also, simplify the logic for including the fixed sample mask in gl_ShaderMask,
and print the fixed sample mask as a hex value for readability of bits.
We'll need to force a temporary and mark it as precise.
MSL is a little weird here, but we can piggyback on top of the invariant
float math option here to force fma() operations everywhere.
Firstly, never flatten inputs or outputs in multi-patch mode.
The main scenario where we do need to care is Block IO.
In this case, we should only flatten the top-level member, and after
that we use access chains as normal.
Using structs in Input storage class is now possible as well. We don't
need to consider per-location fixups at all here. In Vulkan, IO structs
must match exactly. Only plain vectors can have smaller vector sizes as
a special case.
In Metal, the `[[position]]` input to a fragment shader remains at
fragment center, even at sample rate, like OpenGL and Direct3D. In
Vulkan, however, when the fragment shader runs at sample rate, the
`FragCoord` builtin moves to the sample position in the framebuffer,
instead of the fragment center. To account for this difference, adjust
the `FragCoord`, if present, by the sample position. The -0.5 offset is
because the fragment center is at (0.5, 0.5).
Also, add an option to force sample-rate shading in a fragment shader.
Since Metal has no explicit control for this, this is done by adding a
dummy `[[sample_id]]` which is otherwise unused, if none is already
present. This is intended to be used from e.g. MoltenVK when a
pipeline's `minSampleShading` value is nonzero.
Instead of checking if any `Input` variables have `Sample`
interpolation, I've elected to check that the `SampleRateShading`
capability is present. Since `SampleId`, `SamplePosition`, and the
`Sample` interpolation decoration require this cap, this should be
equivalent for any valid SPIR-V module. If this isn't acceptable, let me
know.
New in MSL 2.3 is a template that can be used in the place of a scalar
type in a stage-in struct. This template has methods which interpolate
the varying at the given points. Curiously, you can't set interpolation
attributes on such a varying; perspective-correctness is encoded in the
type, while interpolation must be done using one of the methods. This
makes using this somewhat awkward from SPIRV-Cross, requiring us to jump
through a bunch of hoops to make this all work.
Using varyings from functions in particular is a pain point, requiring
us to pass the stage-in struct itself around. An alternative is to pass
references to the interpolants; except this will fall over badly with
composite types, which naturally must be flattened. As with
tessellation, dynamic indexing isn't supported with pull-model
interpolation. This is because of the need to reference the original
struct member in order to call one of the pull-model interpolation
methods on it. Also, this is done at the variable level; this means that
if one varying in a struct is used with the pull-model functions, then
the entire struct is emitted as pull-model interpolants.
For some reason, this was not documented in the MSL spec, though there
is a property on `MTLDevice`, `supportsPullModelInterpolation`,
indicating support for this, which *is* documented. This does not appear
to be implemented yet for AMD: it returns `NO` from
`supportsPullModelInterpolation`, and pipelines with shaders using the
templates fail to compile. It *is* implemeted for Intel. It's probably
also implemented for Apple GPUs: on Apple Silicon, OpenGL calls down to
Metal, and it wouldn't be possible to use the interpolation functions
without this implemented in Metal.
Based on my testing, where SPIR-V and GLSL have the offset relative to
the pixel center, in Metal it appears to be relative to the pixel's
upper-left corner, as in HLSL. Therefore, I've added an offset 0.4375,
i.e. one half minus one sixteenth, to all arguments to
`interpolate_at_offset()`.
This also fixes a long-standing bug: if a pull-model interpolation
function is used on a varying, make sure that varying is declared. We
were already doing this only for the AMD pull-model function,
`interpolateAtVertexAMD()`; for reasons which are completely beyond me,
we weren't doing this for the base interpolation functions. I also note
that there are no tests for the interpolation functions for GLSL or
HLSL.
I kept the code to replace constant zero arguments, because `Bias` and
`Grad` still have some problems on desktop GPUs.
`Bias` works on AMD GPUs. `Grad` does not. Both work on Intel. Still
needs testing on NV. It will definitely work with Apple GPUs.
`half` cannot be bitcasted to `float`, because the two types are not the
same size. Use an expanding cast instead.
We were already doing this for stores to the tessellation levels; why I
didn't also do this for loads is beyond me.
MSL 2.3 has everything needed to support this extension on all
platforms. The existing `discard_fragment()` function was given demote
semantics, similar to Direct3D, and the `simd_is_helper_thread()`
function was finally added to iOS.
I've left the old test alone. Should I remove it in favor of these?
These need to use arrayed texture types, or Metal will complain when
binding the resource. The target layer is addressed relative to the
Layer output by the vertex pipeline, or to the ViewIndex if in a
multiview pipeline. Unlike with the s/t coordinates, Vulkan does not
forbid non-zero layer coordinates here, though this cannot be expressed
in Vulkan GLSL.
Supporting 3D textures will require additional work. Part of the problem
is that Metal does not allow texture views to subset a 3D texture, so we
need some way to pass the base depth to the shader.
Some older iOS devices don't support layered rendering. In that case,
don't set `[[render_target_array_index]]`, because the compiler will
reject the shader in that case. The client will then have to unroll the
render pass manually.
Prior to this point, we were treating them as flattened, as they are in
old-style tessellation control shaders, and still are for structs in
new-style shaders. This is not true for outputs; output composites are
not flattened at all. This semantic mismatch broke a Vulkan CTS test.
It should now pass.
In Metal render pipelines don't have an option to set a sampleMask
parameter, the only way to get that functionality is to set the
sample_mask output of the fragment shader to this value directly.
We also need to take care to combine the fixed sample mask with the
one that the shader might possibly output.
This should hopefully reduce underutilization of the GPU, especially on
GPUs where the thread execution width is greater than the number of
control points.
This also simplifies initialization by reading the buffer directly
instead of using Metal's vertex-attribute-in-compute support. It turns
out the only way in which shader stages are allowed to differ in their
interfaces is in the number of components per vector; the base type must
be the same. Since we are using the raw buffer instead of attributes, we
can now also emit arrays and matrices directly into the buffer, instead
of flattening them and then unpacking them. Structs are still flattened,
however; this is due to the need to handle vectors with fewer components
than were output, and I think handling this while also directly emitting
structs could get ugly.
Another advantage of this scheme is that the extra invocations needed to
read the attributes when there were more input than output points are
now no more. The number of threads per workgroup is now lcm(SIMD-size,
output control points). This should ensure we always process a whole
number of patches per workgroup.
To avoid complexity handling indices in the tessellation control shader,
I've also changed the way vertex shaders for tessellation are handled.
They are now compute kernels using Metal's support for vertex-style
stage input. This lets us always emit vertices into the buffer in order
of vertex shader execution. Now we no longer have to deal with indexing
in the tessellation control shader. This also fixes a long-standing
issue where if an index were greater than the number of vertices to
draw, the vertex shader would wind up writing outside the buffer, and
the vertex would be lost.
This is a breaking change, and I know SPIRV-Cross has other clients, so
I've hidden this behind an option for now. In the future, I want to
remove this option and make it the default.
Metal is picky about interface matching. If the types don't match
exactly, down to the number of vector components, Metal fails pipline
compilation. To support pipelines where the number of components
consumed by the fragment shader is less than that produced by the vertex
shader, we have to fix up the fragment shader to accept all the
components produced.
DXVK emits SPIR-V where fragment shader builtins have names derived from
DXBC assembly, e.g. `oDepth` for `FragDepth`. When we declared the
disabled output, we used this name, but when referencing it, we
continued to use the GLSL name. This breaks compilation.
Like with `point_size` when not rendering points, Metal complains when
writing to a variable using the `[[depth]]` qualifier when no depth
buffer be attached. In that case, we must avoid emitting `FragDepth`,
just like with `PointSize`.
I assume it will also complain if there be no stencil attachment and the
shader write to `[[stencil]]`, or it write to `[[color(n)]]` but there
be no color attachment at n.
Here, the inline uniform block is explicit: we instantiate the buffer
block itself in the argument buffer, instead of a pointer to the buffer.
I just hope this will work with the `MTLArgumentDescriptor` API...
Note that Metal recursively assigns individual members of embedded
structs IDs. This means for automatic assignment that we have to
calculate the binding stride for a given buffer block. For MoltenVK,
we'll simply increment the ID by the size of the inline uniform block.
Then the later IDs will never conflict with the inline uniform block. We
can get away with this because Metal doesn't require that IDs be
contiguous, only monotonically increasing.
MSL does not support this, so we have to emulate it by passing it around
as a varying between stages. We use a special "user(clipN)" attribute
for this rather than locN which is used for user varyings.
To support loading array of array properly in tessellation, we need a
rewrite of how tessellation access chains are handled.
The major change is to remove the implicit unflatten step inside
access_chain which does not take into account the case where you load
directly from a control point array variable.
We defer unflatten step until OpLoad time instead.
This fixes cases where we load array of {array,matrix,struct}.
Removes the hacky path for MSL access chain index workaround.
If there are enough members in an IAB, we cannot use the constant
address space as MSL compiler complains about there being too many
members. Support emitting the device address space instead.
Rolled the hashes used for glslang, SPIRV-Tools, and SPIRV-Headers to
HEAD, which includes the update to 1.5.
Added passing '--amb' to glslang, so I didn't have to explicitly set
bindings in a large number of test shaders that currently don't, and
now glslang considers them invalid.
Marked all shaders that no longer pass spirv-val as .invalid.
Vulkan has two types of buffer descriptors,
`VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC` and
`VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC`, which allow the client to
offset the buffers by an amount given when the descriptor set is bound
to a pipeline. Metal provides no direct support for this when the buffer
in question is in an argument buffer, so once again we're on our own.
These offsets cannot be stored or associated in any way with the
argument buffer itself, because they are set at bind time. Different
pipelines may have different offsets set. Therefore, we must use a
separate buffer, not in any argument buffer, to hold these offsets. Then
the shader must manually offset the buffer pointer.
This change fully supports arrays, including arrays of arrays, even
though Vulkan forbids them. It does not, however, support runtime
arrays. Perhaps later.
Writable textures cannot use argument buffers on iOS. They must be
passed as arguments directly to the shader function. Since we won't know
if a given storage image will have the `NonWritable` decoration at the
time we encode the argument buffer, we must therefore pass all storage
images as discrete arguments. Previously, we were throwing an error if
we encountered an argument buffer with a writable texture in it on iOS.
This was straightforward to implement in GLSL. The
`ShadingRateInterlockOrderedEXT` and `ShadingRateInterlockUnorderedEXT`
modes aren't implemented yet, because we don't support
`SPV_NV_shading_rate` or `SPV_EXT_fragment_invocation_density` yet.
HLSL and MSL were more interesting. They don't support this directly,
but they do support marking resources as "rasterizer ordered," which
does roughly the same thing. So this implementation scans all accesses
inside the critical section and marks all storage resources found
therein as rasterizer ordered. They also don't support the fine-grained
controls on pixel- vs. sample-level interlock and disabling ordering
guarantees that GLSL and SPIR-V do, but that's OK. "Unordered" here
merely means the order is undefined; that it just so happens to be the
same as rasterizer order is immaterial. As for pixel- vs. sample-level
interlock, Vulkan explicitly states:
> With sample shading enabled, [the `PixelInterlockOrderedEXT` and
> `PixelInterlockUnorderedEXT`] execution modes are treated like
> `SampleInterlockOrderedEXT` or `SampleInterlockUnorderedEXT`
> respectively.
and:
> If [the `SampleInterlockOrderedEXT` or `SampleInterlockUnorderedEXT`]
> execution modes are used in single-sample mode they are treated like
> `PixelInterlockOrderedEXT` or `PixelInterlockUnorderedEXT`
> respectively.
So this will DTRT for MoltenVK and gfx-rs, at least.
MSL additionally supports multiple raster order groups; resources that
are not accessed together can be placed in different ROGs to allow them
to be synchronized separately. A more sophisticated analysis might be
able to place resources optimally, but that's outside the scope of this
change. For now, we assign all resources to group 0, which should do for
our purposes.
`glslang` doesn't support the `RasterizerOrdered` UAVs this
implementation produces for HLSL, so the test case needs `fxc.exe`.
It also insists on GLSL 4.50 for `GL_ARB_fragment_shader_interlock`,
even though the spec says it needs either 4.20 or
`GL_ARB_shader_image_load_store`; and it doesn't support the
`GL_NV_fragment_shader_interlock` extension at all. So I haven't been
able to test those code paths.
Fixes#1002.
This command allows the caller to set the base value of
`BuiltInWorkgroupId`, and thus of `BuiltInGlobalInvocationId`. Metal
provides no direct support for this... but it does provide a builtin,
`[[grid_origin]]`, normally used to pass the base values for the stage
input region, which we will now abuse to pass the dispatch base and
avoid burning a buffer binding.
`[[grid_origin]]`, as part of Metal's support for compute stage input,
requires MSL 1.2. For 1.0 and 1.1, we're forced to provide a buffer.
(Curiously, this builtin was undocumented until the MSL 2.2 release. Go
figure.)
This extension provides a new operation which causes a fragment to be
discarded without terminating the fragment shader invocation. The
invocation for the discarded fragment becomes a helper invocation, so
that derivatives will remain defined. The old `HelperInvocation` builtin
becomes undefined when this occurs, so a second new instruction queries
the current helper invocation status.
This is only fully supported for GLSL. HLSL doesn't support the
`IsHelperInvocation` operation and MSL doesn't support the
`DemoteToHelperInvocation` op.
Fixes#1052.
This provides a few functions normally available in OpenCL to the SPIR-V
shader environment. These functions happen to be available in Metal as
well.
No GLSL, unfortunately. Intel has yet to publish a
`GL_INTEL_shader_integer_functions2` spec.
Fix fallout from changes.
There's a bug in glslang that prevents `float16_t`, `[u]int16_t`, and
`[u]int8_t` constants from adding the corresponding SPIR-V capabilities.
SPIRV-Tools, meanwhile, tightened validation so that these constants are
only valid if the corresponding `Float16`, `Int16`, and `Int8` caps are
on. This affects the `16bit-constants.frag` test for GLSL and MSL.
The only piece added by this extension is the `DeviceIndex` builtin,
which tells the shader which device in a grouped logical device it is
running on.
Metal's pipeline state objects are owned by the `MTLDevice` that created
them. Since Metal doesn't support logical grouping of devices the way
Vulkan does, we'll thus have to create a pipeline state for each device
in a grouped logical device. The upcoming peer group support in Metal 3
will not change this. For this reason, for Metal, the device index is
supplied as a constant at pipeline compile time.
There's an interaction between `VK_KHR_device_group` and
`VK_KHR_multiview` in the
`VK_PIPELINE_CREATE_VIEW_INDEX_FROM_DEVICE_INDEX_BIT`, which defines the
view index to be the same as the device index. The new
`view_index_from_device_index` MSL option supports this functionality.
Using the `PostDepthCoverage` mode specifies that the `gl_SampleMaskIn`
variable is to contain the computed coverage mask following the early
fragment tests, which this mode requires and implicitly enables.
Note that unlike Vulkan and OpenGL, Metal places this on the sample mask
input itself, and furthermore does *not* implicitly enable early
fragment testing. If it isn't enabled explicitly with an
`[[early_fragment_tests]]` attribute, the compiler will error out. So we
have to enable that mode explicitly if `PostDepthCoverage` is enabled
but `EarlyFragmentTests` isn't.
For Metal, only iOS supports this; for some reason, Apple has yet to
implement it on macOS, even though many desktop cards support it.
The old method of using a different unpacked matrix type doesn't work
for scalar alignment. It certainly wouldn't have any effect for a square
matrix, since the number of columns and rows are the same. So now we'll
store them as arrays of packed vectors.
Relaxed block layout relaxed the restrictions on vector alignment,
allowing them to be aligned on scalar boundaries. Scalar block layout
relaxes this further, allowing *any* member to be aligned on a scalar
boundary. The requirement that a vector not improperly straddle a
16-byte boundary is also relaxed.
I've also added a test showing that `std430` layout works with UBOs.
I'm troubled by the dual meaning of the `Packed` extended decoration. In
some instances (struct, `float[]`, and `vec2[]` members), it actually
means the exact opposite, that the member needs extra padding. This is
especially problematic for `vec2[]`, because now we need to distinguish
the two cases by checking the array stride. I wonder if this should
actually be split into two decorations.
There is a case where we can deduce a for/while loop, but the continue
block is actually very painful to deal with, so handle that case as
well. Removes an exceptional case.
MSL prior to 2.2 doesn't support these natively in any stage but
compute. But, we can (assuming no threads were terminated prematurely)
get their values with some creative uses of the
`simd_prefix_exclusive_sum()` and `simd_sum()` functions.
Also, fix a missing `to_expression()` with `BuiltInSubgroupEqMask`.
For KhronosGroup/MoltenVK#629.
This is needed to support `VK_KHR_multiview`, which is in turn needed
for Vulkan 1.1 support. Unfortunately, Metal provides no native support
for this, and Apple is once again less than forthcoming, so we have to
implement it all ourselves.
Tessellation and geometry shaders are deliberately unsupported for now.
The problem is that the current implementation encodes the `ViewIndex`
as part of the `InstanceIndex`, which in the SPIR-V environment at least
only exists in the vertex shader. So we need to work out a way to pass
the view index along to the later stages.
This implementation runs vertex shaders for all views up to the highest
bit set in the view mask, even those whose bits are clear. The fragments
for the inactive views are then discarded. Avoiding this is difficult:
calculating the view indices becomes far more complicated if we can only
run for those views which are set in the mask.