In local-access-chain-convert, we replace loads by load the entire
variable, then doing the extract. The extract will have the same value
as the load. However, if the load has a decoration on it, the
decoration is lost because we do not copy any them to the new id.
This is fixed by rewritting the load into the extract and keeping the
same result id.
This change has the effect that we do not call DCEInst on the loads
because the load is not being deleted, but replaced. This could leave
OpAccessChain instructions around that are not used. This is not a
problem for -O and -Os. They run local_single_*_elim passes and then
dead code elimination. The dce will remove the unused access chains,
and the load elimination passes work even if there are unused access
chains. I have added test to them to ensure they will not loss
opportunities.
Fixes#1787.
The code patterns generated by DXC around function calls can cause many
store to be storing the same value that was just loaded from the same
location:
```
%10 = OpLoad %type %var
OpStore %var %10
```
We want to clean these up very early on because they can cause other
transformations to do a lot of work. For the cases I see, they can be
removed during local-single-block-elim.
For one set of shaders the compile time goes from 248s to 182s. A 26%
improvement.
Part of https://github.com/KhronosGroup/SPIRV-Tools/issues/1494.
Eliminate unused store to variable if followed by store to same
variable in same block.
Most significantly, this cleans up stores made unused by this pass.
These useless stores can inhibit subsequent optimizations, specifically
LocalSingleStoreElim. Eliminating them makes subsequent optimization more
effective.
The main effect of this pass is to simplify the work done by the SSA
rewriter. It catches many local loads/stores that help speeding up the
work done by the main rewriter.
The local-single-store-elim algorithm is not fundamentally bad.
However, when there are a large number of variables, some of the
maps that are used can become very large. These large data structures
then take a very long time to be destroyed. I've seen cases around 40%
if the time.
I've rewritten that algorithm to not use as much memory. This give a
significant improvement when running a large number of shader through
DXC.
I've also made a small change to local-single-block-elim to delete the
loads that is has replaced. That way local-single-store-elim will not
have to look at those. local-single-store-elim now does the same thing.
The time for one set goes from 309s down to 126s. For another set, the
time goes from 102s down to 88s.
The algorithm used in DCEInst to remove dead code is very slow. It is
fine if you only want to remove a small number of instructions, but, if
you need to remove a large number of instructions, then the algorithm in
ADCE is much faster.
This PR removes the calls to DCEInst in the load-store removal passes
and adds a pass of ADCE afterwards.
A number of different iterations of the order of optimization, and I
believe this is the best I could find.
The results I have on 3 sets of shaders are:
Legalization:
Set 1: 5.39 -> 5.01
Set 2: 13.98 -> 8.38
Set 3: 98.00 -> 96.26
Performance passes:
Set 1: 6.90 -> 5.23
Set 2: 10.11 -> 6.62
Set 3: 253.69 -> 253.74
Size reduction passes:
Set 1: 7.16 -> 7.25
Set 2: 17.17 -> 16.81
Set 3: 112.06 -> 107.71
Note that the third set's compile time is large because of the large
number of basic blocks, not so much because of the number of
instructions. That is why we don't see much gain there.
A few optimizations are updates to handle code that is suppose to be
using the logical addressing mode, but still has variables that contain
pointers as long as the pointer are to opaque objects. This is called
"relaxed logical addressing".
|Instruction::GetBaseAddress| will check that pointers that are use meet
the relaxed logical addressing rules. Optimization that now handle
relaxed logical addressing instead of logical addressing are:
- aggressive dead-code elimination
- local access chain convert
- local store elimination passes.
Re-formatted the source tree with the command:
$ /usr/bin/clang-format -style=file -i \
$(find include source tools test utils -name '*.cpp' -or -name '*.h')
This required a fix to source/val/decoration.h. It was not including
spirv.h, which broke builds when the #include headers were re-ordered by
clang-format.
Create a new optimization pass, strength reduction, which will replace
integer multiplication by a constant power of 2 with an equivalent bit
shift. More changes could be added later.
- Does not duplicate constants
- Adds vector |Concat| utility function to a common test header.
Includes code to deal correctly with OpFunctionParameter. This
is needed by opaque propagation which may not exhaustively inline
entry point functions.
Adds ProcessEntryPointCallTree: a method to do work on the
functions in the entry point call trees in a deterministic order.