Go to file
Justin Ridgewell cedec225c9 Implement DFA Unicode Decoder
This is a separation of the DFA Unicode Decoder from
https://chromium-review.googlesource.com/c/v8/v8/+/789560

I attempted to make the DFA's table a bit more explicit in this CL. Still, the
linter prevents me from letting me present the array as a "table" in source
code. For a better representation, please refer to
https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA

- - - - -

Now for a big copy-paste from 789560:

Essentially, reworks a standard FSM (imagine an
array of structs) and flattens it out into a single-dimension array.
Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
map all bytes into one of 12 character classes:

00. 0x00-0x7F
01. 0x80-0x8F (split from general continuation because this range is not
    valid after a 0xF0 leading byte)
02. 0x90-0x9F (split from general continuation because this range is not
    valid after a 0xE0 nor a 0xF4 leading byte)
03. 0xA0-0xBF (the rest of the continuation range)
04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
    includes 255 which we use as a known bad byte during hex-to-int
        decoding)
05. 0xC2-0xDF (leading bytes which require any continuation byte
    afterwards)
06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
    continuation byte after that)
07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
    afterwards then any continuation byte after that)
08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
    continuation byte after that)
09. 0xF1-F3 (leading bytes which requires any continuation byte
    afterwards then any continuation byte then any continuation byte)
10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
    continuation byte then any continuation byte)
11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
    continuation byte then any continuation byte)

Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
needed to represent the transition state ("9, 10, 10, 10" vs.
"10, 9, 9, 9").

Using these 12 classes as "transitions", we can map from one state to
the next. Each state is defined as some multiple of 12, so that we're
always starting at the 0th column of each row of the FSM. From each
state, we add the transition and get a index of the new row the FSM is
entering.

If at any point we encounter a bad byte, the state + bad-byte-transition
is guaranteed to map us into the first row of the FSM (which contains no
valid exiting transitions).

The key differences from Björn's original (or his self-modified) DFA is
the "bad" state is now mapped to 0 (or the first row of the FSM) instead
of 12 (the second row). This saves ~50 bytes when gzipping, and also
speeds up determining if a string is properly encoded (see his sample
code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).

Finally, I've replace his ternary check with an array access, to make
the algorithm branchless. This places a requirement on the caller to 0
out the code point between successful decodings, which it could always
have done because it's already branching.

R=marja@google.com

Bug: 
Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
Reviewed-on: https://chromium-review.googlesource.com/805357
Commit-Queue: Justin Ridgewell <jridgewell@google.com>
Reviewed-by: Marja Hölttä <marja@chromium.org>
Reviewed-by: Mathias Bynens <mathias@chromium.org>
Cr-Commit-Position: refs/heads/master@{#50012}
2017-12-11 21:36:13 +00:00
benchmarks Fix common misspellings 2017-08-02 09:35:28 +00:00
build_overrides Remove build_override for macOS min SDK and deployment target. 2017-07-14 18:24:44 +00:00
docs [Docs] Removed unused docs because they are moved to GitHub 2015-11-19 10:23:30 +00:00
gni [gn] Add v8_monolith static library target. 2017-11-27 10:07:51 +00:00
gypfiles [build] Add generic num fuzzer isolates 2017-11-10 14:26:48 +00:00
include [heap] Provide the number of native and detached context via Heap API. 2017-12-11 18:14:31 +00:00
infra Reland "[tools] Migrate v8_linux64_asan_rel_ng and v8_linux64_asan_rel_ng_triggered to LUCI" 2017-12-08 18:50:00 +00:00
samples [cleanup] use unique_ptr for the DefaultPlatform 2017-11-14 09:57:18 +00:00
src Implement DFA Unicode Decoder 2017-12-11 21:36:13 +00:00
test Implement DFA Unicode Decoder 2017-12-11 21:36:13 +00:00
testing Use GTEST_LANG_CXX11 in V8 2016-07-25 13:06:42 +00:00
third_party Update third_party/binutils version to match Chromium. 2017-12-08 07:28:26 +00:00
tools [test] Move GoogleTestSuite to a separate testcfg. 2017-12-11 20:57:07 +00:00
.clang-format [clang-format] Don't derive pointer alignment 2017-01-17 09:28:19 +00:00
.editorconfig Add .editorconfig 2017-07-28 13:39:24 +00:00
.git-blame-ignore-revs [build] Add DEPS formatting to hyper-blame 2017-10-09 14:08:45 +00:00
.gitignore [d8] Fix --enable-tracing 2017-11-30 10:53:37 +00:00
.gn For building v8 using gn on aix_ppc64, linux_s390x and linux_ppc64(both LE and BE). 2017-05-12 01:35:09 +00:00
.vpython Add a comment for the VPython dep 2017-10-25 12:23:35 +00:00
.ycm_extra_conf.py [ycm] Switch from gnu++11 to gnu++14 2017-11-27 07:48:21 +00:00
AUTHORS [build] Do not fail build if source dir has 'debug' in it. 2017-12-10 22:26:34 +00:00
BUILD.gn Implement DFA Unicode Decoder 2017-12-11 21:36:13 +00:00
ChangeLog [Release] Update ChangeLog file 2016-02-17 21:29:09 +00:00
CODE_OF_CONDUCT.md Explicitly state that the Chromium Code of Conduct also applies to V8 2016-03-02 09:51:24 +00:00
codereview.settings Make Gerrit the default code review for V8 2017-06-30 17:37:37 +00:00
DEPS test262 roll 2017-12-11 20:41:40 +00:00
LICENSE Updates to the license information of third party components. 2015-06-16 12:37:50 +00:00
LICENSE.fdlibm Add LICENSE.fdlibm for all the fdlibm imported sources. 2016-06-09 07:17:03 +00:00
LICENSE.strongtalk Add LICENSE.v8, LICENSE.strongtalk and LICENSE.valgrind to the v8 2011-02-03 07:10:06 +00:00
LICENSE.v8 Add LICENSE.v8, LICENSE.strongtalk and LICENSE.valgrind to the v8 2011-02-03 07:10:06 +00:00
LICENSE.valgrind Add LICENSE.v8, LICENSE.strongtalk and LICENSE.valgrind to the v8 2011-02-03 07:10:06 +00:00
Makefile [test] Continued test support for make targets 2017-10-20 15:01:51 +00:00
Makefile.android [build] Fix d8-for-Android builds 2016-05-20 12:10:45 +00:00
OWNERS Add myself to OWNERS to be able to approve/make infra-related changes 2017-11-09 22:52:19 +00:00
PRESUBMIT.py Add vpython verification presubmit 2017-11-03 10:39:56 +00:00
README.md [Docs] Removed unused docs because they are moved to GitHub 2015-11-19 10:23:30 +00:00
snapshot_toolchain.gni MIPS[64]: Generate snapshots on MIPS big-endian targets 2017-11-24 19:00:32 +00:00
WATCHLISTS Add adamk to a few WATCHLISTS 2017-11-10 21:38:29 +00:00

V8 JavaScript Engine

V8 is Google's open source JavaScript engine.

V8 implements ECMAScript as specified in ECMA-262.

V8 is written in C++ and is used in Google Chrome, the open source browser from Google.

V8 can run standalone, or can be embedded into any C++ application.

V8 Project page: https://github.com/v8/v8/wiki

Getting the Code

Checkout depot tools, and run

    fetch v8

This will checkout V8 into the directory v8 and fetch all of its dependencies. To stay up to date, run

    git pull origin
    gclient sync

For fetching all branches, add the following into your remote configuration in .git/config:

    fetch = +refs/branch-heads/*:refs/remotes/branch-heads/*
    fetch = +refs/tags/*:refs/tags/*

Contributing

Please follow the instructions mentioned on the V8 wiki.