Commit Graph

26 Commits

Author SHA1 Message Date
Justin Ridgewell
cedec225c9 Implement DFA Unicode Decoder
This is a separation of the DFA Unicode Decoder from
https://chromium-review.googlesource.com/c/v8/v8/+/789560

I attempted to make the DFA's table a bit more explicit in this CL. Still, the
linter prevents me from letting me present the array as a "table" in source
code. For a better representation, please refer to
https://docs.google.com/spreadsheets/d/1L9STtkmWs-A7HdK5ZmZ-wPZ_VBjQ3-Jj_xN9c6_hLKA

- - - - -

Now for a big copy-paste from 789560:

Essentially, reworks a standard FSM (imagine an
array of structs) and flattens it out into a single-dimension array.
Using Table 3-7 of the Unicode 10.0.0 standard (page 126 of
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf), we can nicely
map all bytes into one of 12 character classes:

00. 0x00-0x7F
01. 0x80-0x8F (split from general continuation because this range is not
    valid after a 0xF0 leading byte)
02. 0x90-0x9F (split from general continuation because this range is not
    valid after a 0xE0 nor a 0xF4 leading byte)
03. 0xA0-0xBF (the rest of the continuation range)
04. 0xC0-0xC1, 0xF5-0xFF (the joined range of invalid bytes, notice this
    includes 255 which we use as a known bad byte during hex-to-int
        decoding)
05. 0xC2-0xDF (leading bytes which require any continuation byte
    afterwards)
06. 0xE0 (leading byte which requires a 0xA0-0xBF afterwards then any
    continuation byte after that)
07. 0xE1-0xEC, 0xEE-0xEF (leading bytes which requires any continuation
    afterwards then any continuation byte after that)
08. 0xED (leading byte which requires a 0x80-0x9F afterwards then any
    continuation byte after that)
09. 0xF1-F3 (leading bytes which requires any continuation byte
    afterwards then any continuation byte then any continuation byte)
10. 0xF0 (leading bytes which requires a 0x90-0xBF afterwards then any
    continuation byte then any continuation byte)
11. 0xF4 (leading bytes which requires a 0x80-0x8F afterwards then any
    continuation byte then any continuation byte)

Note that 0xF0 and 0xF1-0xF3 were swapped so that fewer bytes were
needed to represent the transition state ("9, 10, 10, 10" vs.
"10, 9, 9, 9").

Using these 12 classes as "transitions", we can map from one state to
the next. Each state is defined as some multiple of 12, so that we're
always starting at the 0th column of each row of the FSM. From each
state, we add the transition and get a index of the new row the FSM is
entering.

If at any point we encounter a bad byte, the state + bad-byte-transition
is guaranteed to map us into the first row of the FSM (which contains no
valid exiting transitions).

The key differences from Björn's original (or his self-modified) DFA is
the "bad" state is now mapped to 0 (or the first row of the FSM) instead
of 12 (the second row). This saves ~50 bytes when gzipping, and also
speeds up determining if a string is properly encoded (see his sample
code at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance).

Finally, I've replace his ternary check with an array access, to make
the algorithm branchless. This places a requirement on the caller to 0
out the code point between successful decodings, which it could always
have done because it's already branching.

R=marja@google.com

Bug: 
Change-Id: I574f208a84dc5d06caba17127b0d41f7ce1a3395
Reviewed-on: https://chromium-review.googlesource.com/805357
Commit-Queue: Justin Ridgewell <jridgewell@google.com>
Reviewed-by: Marja Hölttä <marja@chromium.org>
Reviewed-by: Mathias Bynens <mathias@chromium.org>
Cr-Commit-Position: refs/heads/master@{#50012}
2017-12-11 21:36:13 +00:00
clemensh
f0523e3046 [wasm] Add UTF-8 validation
Names passed for imports and exports are checked during decoding,
leading to errors if they are no valid UTF-8. Function names are not
checked during decode, but rather lead to undefined being returned at
runtime if they are not UTF-8.

We need to do these checks on the Wasm side, since the factory
methods assume to get valid UTF-8 strings.

R=titzer@chromium.org, yangguo@chromium.org

Review-Url: https://codereview.chromium.org/1967023004
Cr-Commit-Position: refs/heads/master@{#36208}
2016-05-12 13:02:14 +00:00
marja
0e3b5386ae Scanner / Unicode decoding: use size_t instead of unsigned.
size_t is the correct data type for this purpose. Our APIs (in particular
ExternalSourceStream::GetMoreData) are already using it, and there were some
static_casts to convert between them.

This CL doesn't intend to fix all of V8, just the minimal sense-making part
around scanner character streams.

BUG=

Review URL: https://codereview.chromium.org/864273005

Cr-Commit-Position: refs/heads/master@{#26449}
2015-02-05 07:54:34 +00:00
jkummerow@chromium.org
f96e386d9a Replace C++ bitfields with our own BitFields
Shave this yak from orbit, it's the only way to be sure.

BUG=chromium:427616
LOG=n
R=svenpanne@chromium.org

Review URL: https://codereview.chromium.org/700963002

Cr-Commit-Position: refs/heads/master@{#25148}
git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@25148 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-11-05 12:41:46 +00:00
yangguo@chromium.org
8659e50723 Update unicode to 7.0.0.
And do not use code points with PATTERN_* property for identifier start.
Maintain that \u180E is a white space character.

BUG=v8:2892
LOG=Y
R=dpino@igalia.com, mathias@qiwi.be

Review URL: https://codereview.chromium.org/638643002

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@24473 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-10-08 14:55:03 +00:00
bmeurer@chromium.org
d07a2eb806 Rename ASSERT* to DCHECK*.
This way we don't clash with the ASSERT* macros
defined by GoogleTest, and we are one step closer
to being able to replace our homegrown base/ with
base/ from Chrome.

R=jochen@chromium.org, svenpanne@chromium.org

Review URL: https://codereview.chromium.org/430503007

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@22812 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-08-04 11:34:54 +00:00
jochen@chromium.org
a4506cd3f2 Move platform abstraction to base library
Also split v8-core independent methods from checks.h to base/logging.h and
merge v8checks with the rest of checks.

The CPU::FlushICache method is moved to CpuFeatures::FlushICache

RoundUp and related methods are moved to base/macros.h

Remove all layering violations from src/libplatform

BUG=none
R=jkummerow@chromium.org
LOG=n

Review URL: https://codereview.chromium.org/358363002

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@22092 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-06-30 13:25:46 +00:00
jochen@chromium.org
56a486c322 Use full include paths everywhere
- this avoids using relative include paths which are forbidden by the style guide
- makes the code more readable since it's clear which header is meant
- allows for starting to use checkdeps

BUG=none
R=jkummerow@chromium.org, danno@chromium.org
LOG=n

Review URL: https://codereview.chromium.org/304153016

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@21625 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-06-03 08:12:43 +00:00
jochen@chromium.org
84e078e561 Reland 21502 - "Move OS::MemCopy and OS::MemMove out of platform to utils"
Verified that arm builds locally.

BUG=none
TBR=jkummerow@chromium.org
LOG=n

Review URL: https://codereview.chromium.org/306473004

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@21512 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-05-27 07:57:22 +00:00
jochen@chromium.org
eabd5a19b9 Revert 21502 - "Move OS::MemCopy and OS::MemMove out of platform to utils"
TBR=jkummerow@chromium.org

Review URL: https://codereview.chromium.org/297303004

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@21504 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-05-26 19:56:27 +00:00
jochen@chromium.org
a5a21a0da4 Move OS::MemCopy and OS::MemMove out of platform to utils
Since both are jitted on some platforms and depend on codegen, they
don't belong to the platform abstraction. At the same time, I can't put
them to codegen.h, as this would introduce cyclic dependencies.

BUG=none
R=jkummerow@chromium.org
LOG=n

Review URL: https://codereview.chromium.org/302563004

git-svn-id: https://v8.googlecode.com/svn/branches/bleeding_edge@21502 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-05-26 19:33:15 +00:00
bmeurer@chromium.org
d4b533d41b Bulk update of Google copyright headers in source files.
R=svenpanne@chromium.org

Review URL: https://codereview.chromium.org/259183002

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@21035 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-04-29 06:42:26 +00:00
dcarney@chromium.org
f93f8ded96 String:WriteUtf8: Add REPLACE_INVALID_UTF8 option
This patch makes String::WriteUtf8 replace invalid code points (i.e. unmatched
surrogates) with the unicode replacement character when REPLACE_INVALID_UTF8 is
set.  This is done to avoid creating invalid UTF-8 output which can lead to
compatibility issues with software requiring valid UTF-8 inputs (e.g. the
WebSocket protocol requires valid UTF-8 and terminates connections when invalid
UTF-8 is encountered).

R=dcarney@chromium.org

BUG=

Review URL: https://codereview.chromium.org/121173009

Patch from Felix Geisendörfer <haimuiba@gmail.com>.

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@18683 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2014-01-20 09:52:54 +00:00
jkummerow@chromium.org
586c4e74b6 Replace OS::MemCopy with OS::MemMove (just as fast but more flexible).
Review URL: https://codereview.chromium.org/13932006

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@14280 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2013-04-16 12:30:51 +00:00
dcarney@chromium.org
5e1d926053 Some Utf8Length microoptimizations
R=yangguo@chromium.org
BUG=

Review URL: https://codereview.chromium.org/12783002

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@13938 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2013-03-13 19:43:45 +00:00
yangguo@chromium.org
0c822b21cb Fix some latin-1 webkit units tests
R=yangguo@chromium.org
BUG=

Review URL: https://chromiumcodereview.appspot.com/11962035
Patch from Dan Carney <dcarney@google.com>.

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@13455 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2013-01-21 16:11:31 +00:00
yangguo@chromium.org
a8d59243b9 Cleanup latin-1 conversion check in regexp engine
R=yangguo@chromium.org
BUG=

Review URL: https://chromiumcodereview.appspot.com/11880045
Patch from Dan Carney <dcarney@google.com>.

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@13400 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2013-01-16 13:04:07 +00:00
yangguo@chromium.org
04ccb975f4 Remove InputBuffer
R=yangguo@chromium.org
BUG=

Review URL: https://chromiumcodereview.appspot.com/11727004
Patch from Dan Carney <dcarney@google.com>.

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@13298 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2013-01-03 09:18:01 +00:00
yangguo@chromium.org
eedcaf1866 Remove Utf8InputBuffer
R=yangguo@chromium.org
BUG=

Review URL: https://chromiumcodereview.appspot.com/11649018
Patch from Dan Carney <dcarney@google.com>.

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@13248 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2012-12-20 09:20:37 +00:00
yangguo@chromium.org
7cbca775ee Reland regexp global optimizations.
BUG=

Review URL: https://chromiumcodereview.appspot.com/10872010

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@12396 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2012-08-28 09:37:41 +00:00
erik.corry@gmail.com
03cfc4363b Fix input and output to handle UTF16 surrogate pairs.
Review URL: https://chromiumcodereview.appspot.com/9600009

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@11007 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2012-03-12 12:35:28 +00:00
lrn@chromium.org
1d24f5f56b Updated unicode library.
Added Nl category to letters predicate (as requried for JS identifiers).
Changed/simplified representation of canonicalization ranges.
Truncated tables to code points in the BMP (all that is used by JS).
Reformatted tables to avoid excessively long lines.
Removed duplicate entries from multi-character mapping result tables.

Review URL: http://codereview.chromium.org/3030026

git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@5155 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2010-07-30 07:10:22 +00:00
deanm@chromium.org
eb906555fc Cleanup include guards:
- Fix some typos / guards that didn't match the filename.
- Fix some style inconsistencies.
- Add guards to files that were missing them.
- Add the directory name to the guard.

Review URL: http://codereview.chromium.org/99343


git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@1845 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2009-05-04 13:36:43 +00:00
christian.plesner.hansen@gmail.com
9bed566bdb Changed copyright header from google inc. to v8 project authors.
Added presubmit step to check copyright.



git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@242 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2008-09-09 20:08:45 +00:00
christian.plesner.hansen@gmail.com
3351499cb5 Fixed problem where asian characters were not categorized as letters
because they were defined using different syntax in the unicode
database.



git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@200 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2008-09-08 10:45:01 +00:00
christian.plesner.hansen
43d26ecc35 Initial export.
git-svn-id: http://v8.googlecode.com/svn/trunk@2 ce2b1a6d-e550-0410-aec6-3dcde31c8c00
2008-07-03 15:10:15 +00:00