This is a separation of the DFA Unicode Decoder from
https://chromium-review.googlesource.com/c/v8/v8/+/789560
I attempted to make the DFA's table a bit more explicit in this CL. Still, the
linter prevents me from letting me present the array as a "table" in source
code. For a better representation, please refer to
https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA
- - - - -
Now for a big copy-paste from 789560:
Essentially, reworks a standard FSM (imagine an
array of structs) and flattens it out into a single-dimension array.
Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
map all bytes into one of 12 character classes:
00. 0x00-0x7F
01. 0x80-0x8F (split from general continuation because this range is not
valid after a 0xF0 leading byte)
02. 0x90-0x9F (split from general continuation because this range is not
valid after a 0xE0 nor a 0xF4 leading byte)
03. 0xA0-0xBF (the rest of the continuation range)
04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
includes 255 which we use as a known bad byte during hex-to-int
decoding)
05. 0xC2-0xDF (leading bytes which require any continuation byte
afterwards)
06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
continuation byte after that)
07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
afterwards then any continuation byte after that)
08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
continuation byte after that)
09. 0xF1-F3 (leading bytes which requires any continuation byte
afterwards then any continuation byte then any continuation byte)
10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
continuation byte then any continuation byte)
11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
continuation byte then any continuation byte)
Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
needed to represent the transition state ("9, 10, 10, 10" vs.
"10, 9, 9, 9").
Using these 12 classes as "transitions", we can map from one state to
the next. Each state is defined as some multiple of 12, so that we're
always starting at the 0th column of each row of the FSM. From each
state, we add the transition and get a index of the new row the FSM is
entering.
If at any point we encounter a bad byte, the state + bad-byte-transition
is guaranteed to map us into the first row of the FSM (which contains no
valid exiting transitions).
The key differences from Björn's original (or his self-modified) DFA is
the "bad" state is now mapped to 0 (or the first row of the FSM) instead
of 12 (the second row). This saves ~50 bytes when gzipping, and also
speeds up determining if a string is properly encoded (see his sample
code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).
Finally, I've replace his ternary check with an array access, to make
the algorithm branchless. This places a requirement on the caller to 0
out the code point between successful decodings, which it could always
have done because it's already branching.
R=marja@google.com
Bug:
Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
Reviewed-on: https://chromium-review.googlesource.com/805357
Commit-Queue: Justin Ridgewell <jridgewell@google.com>
Reviewed-by: Marja Hölttä <marja@chromium.org>
Reviewed-by: Mathias Bynens <mathias@chromium.org>
Cr-Commit-Position: refs/heads/master@{#50012}
This patch normalizes the casing of hexadecimal digits in escape
sequences of the form `\xNN` and integer literals of the form
`0xNNNN`.
Previously, the V8 code base used an inconsistent mixture of uppercase
and lowercase.
Google’s C++ style guide uses uppercase in its examples:
https://google.github.io/styleguide/cppguide.html#Non-ASCII_Characters
Moreover, uppercase letters more clearly stand out from the lowercase
`x` (or `u`) characters at the start, as well as lowercase letters
elsewhere in strings.
BUG=v8:7109
TBR=marja@chromium.org,titzer@chromium.org,mtrofin@chromium.org,mstarzinger@chromium.org,rossberg@chromium.org,yangguo@chromium.org,mlippautz@chromium.org
NOPRESUBMIT=true
Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.linux:linux_chromium_rel_ng
Change-Id: I790e21c25d96ad5d95c8229724eb45d2aa9e22d6
Reviewed-on: https://chromium-review.googlesource.com/804294
Commit-Queue: Mathias Bynens <mathias@chromium.org>
Reviewed-by: Jakob Kummerow <jkummerow@chromium.org>
Cr-Commit-Position: refs/heads/master@{#49810}
Verify that both UTF-8 decoders (incremental and non-incremental one) match the
expectations.
Also cleanup / harden the UTF-8 handling code, as suggested in
https://chromium-review.googlesource.com/c/v8/v8/+/671020/ .
BUG=chromium:765608
Change-Id: I6344d62ca15b75ac8e333421c94c4aa35ab8190d
Reviewed-on: https://chromium-review.googlesource.com/681217
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48229}
Follow up to https://chromium-review.googlesource.com/671020
We still didn't return the correct amount of invalid characters, according to
the Encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder ), when we
saw a byte sequence which was as start of an overlong / invalid sequence, but
there weren't enough continuation bytes.
A more rigorous test will follow in
https://chromium-review.googlesource.com/c/v8/v8/+/681217
BUG=chromium:765608
Change-Id: I535670edc14d3bae144e5a9ca373f12eec78a934
Reviewed-on: https://chromium-review.googlesource.com/681674
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Reviewed-by: Daniel Vogelheim <vogelheim@chromium.org>
Reviewed-by: Yang Guo <yangguo@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48165}
This fix is two-fold:
1) Incremental UTF-8 decoding: Unify incorrect UTF-8 handling between V8 and
Blink.
Incremental UTF-8 decoding used to allow some overlong sequences / invalid code
points which Blink treated as errors. This caused the decoder and the Blink
UTF-8 decoder to produce a different number of bytes, resulting in random
failures when scripts were streamed (especially, this was detected by the
skipping inner functions feature which adds CHECKs against expected function
positions).
2) Non-incremental UTF-8 decoding: return the correct amount of invalid characters.
According to the encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder
), the first byte of an overlong sequence / invalid code point generates an
invalid character, and the rest of the bytes are not processed (i.e., pushed
back to the byte stream). When they're handled, they will look like lonely
continuation bytes, and will generate an invalid character each.
As a result, an overlong 4-byte sequence should generate 4 invalid characters
(not 1).
This is a potentially breaking change, since the (non-incremental) UTF-8
decoding is exposed via the API (String::NewFromUtf8). The behavioral difference
happens when the client is passing in invalid UTF-8 (containing overlong /
surrogate sequences).
However, afaict, this doesn't change the semantics of any JavaScript program:
according to the ECMAScript spec, the program is a sequence of Unicode code
points, and there's no way to invoke the UTF-8 decoding functionalities from
inside JavaScript. Though, this changes the behavior of d8 when decoding source
files which are invalid UTF-8.
This doesn't change anything related to URI decoding (it already throws
exceptions for overlong sequences / invalid code points).
BUG: chromium:765608, chromium:758236, v8:5516
Bug:
Change-Id: Ib029f6a8e87186794b092e4e8af32d01cee3ada0
Reviewed-on: https://chromium-review.googlesource.com/671020
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Franziska Hinkelmann <franzih@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48105}
For such a simple predicate, calling a(n inline) function that checks against
the values is faster (*) than maintaining the cache.
(*) When scanning a file that contains only comments, we're basically calling
IsLineTerminator in a loop. Parsing such files is now 7-18% faster in local
experiments.
BUG=v8:6092
Change-Id: I6a8f2aba9669a76152292f4e6c7853638d15aae3
Reviewed-on: https://chromium-review.googlesource.com/645633
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Adam Klein <adamk@chromium.org>
Cr-Commit-Position: refs/heads/master@{#47810}
icu-case-mapping was shipped a few months ago. By dropping
the flag, unibrow's case conversion code won't be included
by default because V8_INTL_SUPPORT is on by default.
BUG=v8:4477, v8:4476
TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*,
mjsunit/string-case, intl/general/case*
Cq-Include-Trybots: master.tryserver.v8:v8_linux_noi18n_rel_ng
Change-Id: I78be9cc64b4588bc5af79ecbbadf93af6e84a1df
Reviewed-on: https://chromium-review.googlesource.com/534541
Commit-Queue: Jungshik Shin <jshin@chromium.org>
Reviewed-by: Benedikt Meurer <bmeurer@chromium.org>
Reviewed-by: Daniel Ehrenberg <littledan@chromium.org>
Cr-Commit-Position: refs/heads/master@{#46304}
Use ICU to check ID_Start, ID_Continue and WhiteSpace even for BMP
when V8_INTL_SUPPORT is on (which is default).
Change LineTerminator::Is() to check 4 code points from
ES#sec-line-terminators instead of using tables and Lookup function.
Remove Lowercase::Is(). It's not used anywhere.
Update webkit/{ToNumber,parseFloat}.js to have the correct expectation
for U+180E and the corresponding expected files. This is a follow-up to
an earlier change ( https://codereview.chromium.org/2720953003 ).
CQ_INCLUDE_TRYBOTS=master.tryserver.v8:v8_win_dbg,v8_mac_dbg;master.tryserver.chromium.android:android_arm64_dbg_recipe
CQ_INCLUDE_TRYBOTS=master.tryserver.v8:v8_linux_noi18n_rel_ng
BUG=v8:5370,v8:5155
TEST=unittests --gtest_filter=CharP*
TEST=webkit: ToNumber, parseFloat
TEST=test262: built-ins/Number/S9.3*, built-ins/parse{Int,Float}/S15*
TEST=test262: language/white-space/mong*
TEST=test262: built-ins/String/prototype/trim/u180e
TEST=mjsunit: whitespaces
Review-Url: https://codereview.chromium.org/2331303002
Cr-Commit-Position: refs/heads/master@{#45957}
The validation of utf8 strings in WebAssembly modules used the character
kBadChar = 0xFFFD to indicate a validation error. However, this
character can appear in a valid utf8 string. This CL fixes this problem
by duplicating some of the code in {Utf8::CalculateValue} and inlining
it directly into Utf8::Validate. Note that Utf8::Validate is used only
for WebAssembly.
Tests for this change are in the WebAssembly spec tests, which I will
update in a separate CL.
R=vogelheim@chromium.org
Change-Id: I8697b9299f3e98a8eafdf193bff8bdff90efd7dc
Reviewed-on: https://chromium-review.googlesource.com/509534
Reviewed-by: Daniel Vogelheim <vogelheim@chromium.org>
Commit-Queue: Andreas Haas <ahaas@chromium.org>
Cr-Commit-Position: refs/heads/master@{#45476}
Unibrow is currently at Unicode version 7.0.0, which does not
include mongolian vowel separator (\u180E) as white space. In
order to appease test262 at the time however we kept it as a
whitespace.
Test262 has since then been updated. And while this is not an
update of unibrow, we are removing \u180E as white space here.
R=jshin@chromium.org, littledan@chromium.org
BUG=v8:5155
Review-Url: https://codereview.chromium.org/2720953003
Cr-Commit-Position: refs/heads/master@{#43485}
This code should not access bytes out of the permitted range in order to check
the range of a possible UTF-8 value. Instead, the length check should occur
before such checks.
BUG=chromium:667260, chromium:662822
Review-Url: https://codereview.chromium.org/2520053003
Cr-Commit-Position: refs/heads/master@{#41165}
This brings the two utf-8 decoders (bulk + incremental) in line.
Technically, either behaviour was correct, since the utf-8 spec
demands incomplete utf-8 be handled, but does not specify how.
Unicode recommends that "the maximal subpart at that offset
should be replaced by a single U+FFFD," and with this change we
consistently do that. More details + spec references in the bug.
BUG=chromium:662822
Review-Url: https://codereview.chromium.org/2493143003
Cr-Commit-Position: refs/heads/master@{#41025}
- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.
Observe:
- Performance:
- All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
body of only a few instructions. I expect most calls to end up there.
- There used to be performance problems w/ bookmarking, particularly
with copying too much data on SetBookmark w/ UTF-8 streaming streams.
All those copies are gone.
- The old streaming streams implementation used to copy data even for
2-byte input. It no longer does.
- The only remaining 'slow' method is the Seek(.) slow case for utf-8
streaming streams. I don't expect this to be called a lot; and even if,
I expect it to be offset by the gains in the (vastly more frequent)
calls to the other methods or the 'fast path'.
- If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
- I want to remove the 'old' API in a follow-up CL, which should mostly
delete code, or replace it 1:1.
- In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
for streaming streams.
- The "bookmark" is now always implemented (and mostly very fast), so we
should be able to use it for more things.
- Testing & correctness:
- The unit tests now cover all stream implementations,
and are pretty good and triggering all the edge cases.
- Vastly more DCHECKs of the invariants.
BUG=v8:4947
Review-Url: https://codereview.chromium.org/2314663002
Cr-Commit-Position: refs/heads/master@{#39464}
Names passed for imports and exports are checked during decoding,
leading to errors if they are no valid UTF-8. Function names are not
checked during decode, but rather lead to undefined being returned at
runtime if they are not UTF-8.
We need to do these checks on the Wasm side, since the factory
methods assume to get valid UTF-8 strings.
R=titzer@chromium.org, yangguo@chromium.org
Review-Url: https://codereview.chromium.org/1967023004
Cr-Commit-Position: refs/heads/master@{#36208}
size_t is the correct data type for this purpose. Our APIs (in particular
ExternalSourceStream::GetMoreData) are already using it, and there were some
static_casts to convert between them.
This CL doesn't intend to fix all of V8, just the minimal sense-making part
around scanner character streams.
BUG=
Review URL: https://codereview.chromium.org/864273005
Cr-Commit-Position: refs/heads/master@{#26449}
Added Nl category to letters predicate (as requried for JS identifiers).
Changed/simplified representation of canonicalization ranges.
Truncated tables to code points in the BMP (all that is used by JS).
Reformatted tables to avoid excessively long lines.
Removed duplicate entries from multi-character mapping result tables.
Review URL: http://codereview.chromium.org/3030026
git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@5155 ce2b1a6d-e550-0410-aec6-3dcde31c8c00