6da6e45099
Optimize FastDtoa, in particular Grisu3. In addition to making a microbenchmark, there are a number of smaller and larger changes here: - Replace divisions by power-of-ten with multiplications by their inverses, using an algorithm very similar to the one in libdivide. - For DiyFp::Times(), use 128-bit hardware multiplication if available (which it generally is on 64-bit platforms). - Where possible, send around a pointer to the end of the string, instead of a pointer and a length, reducing register pressure (especially for Intel). Where not (easily) possible, add a local variable to make the compiler understand that length and decimal_point cannot alias. - Change some ints to unsigneds where it helps us avoid sign extensions. - Some minor changes to reduce instruction dependency chains. - Inline BiggestPowerTen(). Actual performance gain is wildly different between platforms. On my 3990X workstation (Zen 2), gains are about 21%. On a M1 Mac Mini, they are about 17%. But on my i7-10610U laptop (Comet Lake, so Skylake microarchitecture), the function is 78% faster. This is probably because large divisions (divisor over 255) seem to hurt a lot on Skylake, but I haven't gone through it in detail. Change-Id: I5b67c257d788a3f7d1be7065d055456852451d68 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/4110741 Commit-Queue: Steinar H Gunderson <sesse@chromium.org> Reviewed-by: Michael Lippautz <mlippautz@chromium.org> Cr-Commit-Position: refs/heads/main@{#84906} |
||
---|---|---|
.. | ||
cpp | ||
csuite | ||
benchmarks.status | ||
BUILD.gn | ||
testcfg.py |