v8/test/mjsunit/regexp-experimental.js

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

99 lines
3.4 KiB
JavaScript
Raw Normal View History

// Copyright 2020 the V8 project authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
// Flags: --allow-natives-syntax --enable-experimental-regexp-engine
function Test(regexp, subject, expectedResult, expectedLastIndex) {
[regexp] Fix usage of {Is,Mark}PcProcessed in NfaInterpreter Previously we checked whether a thread's pc IsPcProcessed before pushing to the stack of (postponed) active_threads_. This commit moves the IsPcProcessed check and corresponding MarkPcProcessed call to when the thread is actually processed, i.e. when it is popped from the active_threads_ stack again. This fixes two issues: - Consider what used to happen in the following scenario: 1. An active thread t is postponed (e.g. because it is a fork) and pushed on active_threads_. IsPcProcessed(t.pc) is false, so t is not discarded and does actually end up on active_threads_. 2. Some other thread s is executed, and at some point s.pc == t.pc, i.e. t.pc is marked as processed. 3. t is popped from active_threads_ for processing. In 3 we don't want to continue execution of t: After all, its pc is already marked as processed. But because previously we only checked for IsPcProcessed in step 1 before pushing to active_threads_, we used to continue execution in 3. I don't think this is a correctness issue, but possibly a performance problem. In any case, this commit moves the IsPcProcessed check from 1 to 3 and so fixes this. - After flushing blocked_threads_, we push them to active_threads_ again. While doing so, we used to mark these thread's pcs as processed. This meant that sometimes a (fork of a) high priority thread was cancelled by the IsPcProcessed check even though its pc was only marked as processed by a thread with lower priority during flushing. We need it to be the other way round: The low priority thread should be cancelled after its pc is processed by a thread with higher priority. With this commit we don't MarkPcProcessed during flushing, it's postponed to when we're actually processing. This was a correctness issue, and there's a new corresponding test case. Bug: v8:10765 Change-Id: Ie12682cf3f8a04222d907edd8a3ad25baa69465a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2388112 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#69668}
2020-09-02 08:28:34 +00:00
assertEquals(%RegexpTypeTag(regexp), "EXPERIMENTAL");
var result = regexp.exec(subject);
[regexp] Support assertions in experimental engine Assertions are implemented with the new ASSERTION instruction. The nfa interpreter evaluates the assertion based on the current context in the subject string every time a thread executes ASSERTION. This is analogous to what re2 and rust/regex do. Alternatives to this approach: - The interpreter could calculate eagerly for all assertion types whether they are satisfied whenever the current input position is advanced. This would make evaluating the ASSERTION instruction itself cheaper, but at the cost of making every advance in the input string more expensive. I suspect this would be slower on average because assertions are not that common that we typically evaluate >= 2 assertions at every input position. - Assertions in a regexp could be desugared into CONSUME_RANGE instructions, so that no new instruction would be necessary. For example, the word boundary assertion \b is satisfied at a given position/state if we have just consumed a word character and will consume a non-word character next, or vice-versa. The tricky part about this is that the assertion itself should not consume input, so we'd have to split (automaton) states according to whether we've arrived at them via a word character or not. The current compiler is not really equipped for this kind of transformation. For {start,end} of {line,file} assertions, we'd need to introduce dummy characters indicating start/end of input (say, 0x10000 and 0x10001) which we feed to the interpreter before respectively after the actual input. I suspect that this approach wouldn't make much of a difference for NFA execution. It would likely speed up (lazy) DFA execution though because assertions would be dealt with in the fast path. Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng Bug: v8:10765 Change-Id: Ic2012c943e0ce54eb8662789fb3d4c1b6cd8d606 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2398644 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Leszek Swirski <leszeks@chromium.org> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#70026}
2020-09-17 16:24:35 +00:00
if (result instanceof Array && expectedResult instanceof Array) {
assertArrayEquals(expectedResult, result);
} else {
assertEquals(expectedResult, result);
}
[regexp] Fix usage of {Is,Mark}PcProcessed in NfaInterpreter Previously we checked whether a thread's pc IsPcProcessed before pushing to the stack of (postponed) active_threads_. This commit moves the IsPcProcessed check and corresponding MarkPcProcessed call to when the thread is actually processed, i.e. when it is popped from the active_threads_ stack again. This fixes two issues: - Consider what used to happen in the following scenario: 1. An active thread t is postponed (e.g. because it is a fork) and pushed on active_threads_. IsPcProcessed(t.pc) is false, so t is not discarded and does actually end up on active_threads_. 2. Some other thread s is executed, and at some point s.pc == t.pc, i.e. t.pc is marked as processed. 3. t is popped from active_threads_ for processing. In 3 we don't want to continue execution of t: After all, its pc is already marked as processed. But because previously we only checked for IsPcProcessed in step 1 before pushing to active_threads_, we used to continue execution in 3. I don't think this is a correctness issue, but possibly a performance problem. In any case, this commit moves the IsPcProcessed check from 1 to 3 and so fixes this. - After flushing blocked_threads_, we push them to active_threads_ again. While doing so, we used to mark these thread's pcs as processed. This meant that sometimes a (fork of a) high priority thread was cancelled by the IsPcProcessed check even though its pc was only marked as processed by a thread with lower priority during flushing. We need it to be the other way round: The low priority thread should be cancelled after its pc is processed by a thread with higher priority. With this commit we don't MarkPcProcessed during flushing, it's postponed to when we're actually processing. This was a correctness issue, and there's a new corresponding test case. Bug: v8:10765 Change-Id: Ie12682cf3f8a04222d907edd8a3ad25baa69465a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2388112 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#69668}
2020-09-02 08:28:34 +00:00
assertEquals(expectedLastIndex, regexp.lastIndex);
}
// The empty regexp.
Test(new RegExp(""), "asdf", [""], 0);
// Plain patterns without special operators.
Test(/asdf1/, "123asdf1xyz", ["asdf1"], 0);
// Escaped operators, otherwise plain string:
Test(/\*\.\(\[\]\?/, "123*.([]?123", ["*.([]?"], 0);
// Some two byte values:
Test(/쁰d섊/, "123쁰d섊abc", ["쁰d섊"], 0);
// A pattern with surrogates but without unicode flag:
Test(/💩f/, "123💩f", ["💩f"], 0);
// Disjunctions.
Test(/asdf|123/, "xyz123asdf", ["123"], 0);
Test(/asdf|123|fj|f|a/, "da123", ["a"], 0);
Test(/|123/, "123", [""], 0);
// Character ranges.
Test(/[abc]/, "123asdf", ["a"], 0);
Test(/[0-9]/, "asdf123xyz", ["1"], 0);
Test(/[^0-9]/, "123!xyz", ["!"], 0);
Test(/\w\d/, "?a??a3!!!", ["a3"], 0);
// [💩] without unicode flag is a character range matching one of the two
// surrogate characters that make up 💩. The leading surrogate is 0xD83D.
Test(/[💩]/, "f💩", [String.fromCodePoint(0xD83D)], 0);
// Greedy and non-greedy quantifiers.
Test(/x*/, "asdfxk", [""], 0);
[regexp] Fix usage of {Is,Mark}PcProcessed in NfaInterpreter Previously we checked whether a thread's pc IsPcProcessed before pushing to the stack of (postponed) active_threads_. This commit moves the IsPcProcessed check and corresponding MarkPcProcessed call to when the thread is actually processed, i.e. when it is popped from the active_threads_ stack again. This fixes two issues: - Consider what used to happen in the following scenario: 1. An active thread t is postponed (e.g. because it is a fork) and pushed on active_threads_. IsPcProcessed(t.pc) is false, so t is not discarded and does actually end up on active_threads_. 2. Some other thread s is executed, and at some point s.pc == t.pc, i.e. t.pc is marked as processed. 3. t is popped from active_threads_ for processing. In 3 we don't want to continue execution of t: After all, its pc is already marked as processed. But because previously we only checked for IsPcProcessed in step 1 before pushing to active_threads_, we used to continue execution in 3. I don't think this is a correctness issue, but possibly a performance problem. In any case, this commit moves the IsPcProcessed check from 1 to 3 and so fixes this. - After flushing blocked_threads_, we push them to active_threads_ again. While doing so, we used to mark these thread's pcs as processed. This meant that sometimes a (fork of a) high priority thread was cancelled by the IsPcProcessed check even though its pc was only marked as processed by a thread with lower priority during flushing. We need it to be the other way round: The low priority thread should be cancelled after its pc is processed by a thread with higher priority. With this commit we don't MarkPcProcessed during flushing, it's postponed to when we're actually processing. This was a correctness issue, and there's a new corresponding test case. Bug: v8:10765 Change-Id: Ie12682cf3f8a04222d907edd8a3ad25baa69465a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2388112 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#69668}
2020-09-02 08:28:34 +00:00
Test(/xx*a/, "xxa", ["xxa"], 0);
Test(/x*[xa]/, "xxaa", ["xxa"], 0);
Test(/x*?[xa]/, "xxaa", ["x"], 0);
Test(/x*?a/, "xxaa", ["xxa"], 0);
Test(/x+a/, "axxa", ["xxa"], 0);
Test(/x+?[ax]/, "axxa", ["xx"], 0);
Test(/xx?[xa]/, "xxaa", ["xxa"], 0);
Test(/xx??[xa]/, "xxaa", ["xx"], 0);
Test(/xx??a/, "xxaa", ["xxa"], 0);
Test(/x{4}/, "xxxxxxxxx", ["xxxx"], 0);
Test(/x{4,}/, "xxxxxxxxx", ["xxxxxxxxx"], 0);
Test(/x{4,}?/, "xxxxxxxxx", ["xxxx"], 0);
Test(/x{2,4}/, "xxxxxxxxx", ["xxxx"], 0);
Test(/x{2,4}?/, "xxxxxxxxx", ["xx"], 0);
// Non-capturing groups and nested operators.
Test(/(?:)/, "asdf", [""], 0);
Test(/(?:asdf)/, "123asdfxyz", ["asdf"], 0);
Test(/(?:asdf)|123/, "xyz123asdf", ["123"], 0);
Test(/asdf(?:[0-9]|(?:xy|x)*)*/, "kkkasdf5xyx8xyyky", ["asdf5xyx8xy"], 0);
[regexp] Support capture groups in experimental engine This commit adds support for capture groups (as in e.g. /x(123|abc)y/) in the experimental regexp engine. Now every InterpreterThread owns a register array containing (sub)match boundaries. There is a new instruction to record the current input index in some register. Submatches in quantifier bodies should be reported only if they occur during the last repetition. Thus we reset those registers before attempting to match the body of a quantifier. This is implemented with another new instruction. Because of concerns for the growing sizeof the NfaInterpreter object (which is allocated on the stack), this commit replaces the `SmallVector` members of the NfaInterpreter with zone-allocated arrays. Register arrays, which for a fixed regexp are all the same size, are allocated with a RecyclingZoneAllocator for cheap memory reclamation via a linked list of equally-sized free blocks. Possible optimizations for management of register array memory: 1. If there are few register per thread, then it is likely faster to store them inline in the InterpreterThread struct. 2. re2 implements copy-on-write: InterpreterThreads can share the same register array. If a thread attempts to write to shared register array, the register array is cloned first. 3. The register at index 1 contains the end of the match; this is only written to right before an ACCEPT statement. We could make ACCEPT equivalent to what's currently CAPTURE 1 followed by ACCEPT. We could then save the memory for register 1 for threads that haven't finished yet. This is particularly interesting if now optimization 1 kicks in. Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng Bug: v8:10765 Change-Id: I2c0503206ce331e13ac9912945bb66736d740197 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2390770 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#69929}
2020-09-15 19:03:51 +00:00
// Capturing groups.
Test(/()/, "asdf", ["", ""], 0);
Test(/(123)/, "asdf123xyz", ["123", "123"], 0);
Test(/asdf(123)xyz/, "asdf123xyz", ["asdf123xyz", "123"], 0);
Test(/(123|xyz)/, "123", ["123", "123"], 0);
Test(/(123|xyz)/, "xyz", ["xyz", "xyz"], 0);
Test(/(123)|(xyz)/, "123", ["123", "123", undefined], 0);
Test(/(123)|(xyz)/, "xyz", ["xyz", undefined, "xyz"], 0);
Test(/(?:(123)|(xyz))*/, "xyz123", ["xyz123", "123", undefined], 0);
Test(/((123)|(xyz)*)*/, "xyz123xyz", ["xyz123xyz", "xyz", undefined, "xyz"], 0);
[regexp] Support assertions in experimental engine Assertions are implemented with the new ASSERTION instruction. The nfa interpreter evaluates the assertion based on the current context in the subject string every time a thread executes ASSERTION. This is analogous to what re2 and rust/regex do. Alternatives to this approach: - The interpreter could calculate eagerly for all assertion types whether they are satisfied whenever the current input position is advanced. This would make evaluating the ASSERTION instruction itself cheaper, but at the cost of making every advance in the input string more expensive. I suspect this would be slower on average because assertions are not that common that we typically evaluate >= 2 assertions at every input position. - Assertions in a regexp could be desugared into CONSUME_RANGE instructions, so that no new instruction would be necessary. For example, the word boundary assertion \b is satisfied at a given position/state if we have just consumed a word character and will consume a non-word character next, or vice-versa. The tricky part about this is that the assertion itself should not consume input, so we'd have to split (automaton) states according to whether we've arrived at them via a word character or not. The current compiler is not really equipped for this kind of transformation. For {start,end} of {line,file} assertions, we'd need to introduce dummy characters indicating start/end of input (say, 0x10000 and 0x10001) which we feed to the interpreter before respectively after the actual input. I suspect that this approach wouldn't make much of a difference for NFA execution. It would likely speed up (lazy) DFA execution though because assertions would be dealt with in the fast path. Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng Bug: v8:10765 Change-Id: Ic2012c943e0ce54eb8662789fb3d4c1b6cd8d606 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2398644 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Leszek Swirski <leszeks@chromium.org> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#70026}
2020-09-17 16:24:35 +00:00
// Assertions.
Test(/asdf\b/, "asdf---", ["asdf"], 0);
Test(/asdf\b/, "asdfg", null, 0);
Test(/asd[fg]\B/, "asdf asdgg", ["asdg"], 0);
Test(/^asd[fg]/, "asdf asdgg", ["asdf"], 0);
[regexp] Support assertions in experimental engine Assertions are implemented with the new ASSERTION instruction. The nfa interpreter evaluates the assertion based on the current context in the subject string every time a thread executes ASSERTION. This is analogous to what re2 and rust/regex do. Alternatives to this approach: - The interpreter could calculate eagerly for all assertion types whether they are satisfied whenever the current input position is advanced. This would make evaluating the ASSERTION instruction itself cheaper, but at the cost of making every advance in the input string more expensive. I suspect this would be slower on average because assertions are not that common that we typically evaluate >= 2 assertions at every input position. - Assertions in a regexp could be desugared into CONSUME_RANGE instructions, so that no new instruction would be necessary. For example, the word boundary assertion \b is satisfied at a given position/state if we have just consumed a word character and will consume a non-word character next, or vice-versa. The tricky part about this is that the assertion itself should not consume input, so we'd have to split (automaton) states according to whether we've arrived at them via a word character or not. The current compiler is not really equipped for this kind of transformation. For {start,end} of {line,file} assertions, we'd need to introduce dummy characters indicating start/end of input (say, 0x10000 and 0x10001) which we feed to the interpreter before respectively after the actual input. I suspect that this approach wouldn't make much of a difference for NFA execution. It would likely speed up (lazy) DFA execution though because assertions would be dealt with in the fast path. Cq-Include-Trybots: luci.v8.try:v8_linux64_fyi_rel_ng Bug: v8:10765 Change-Id: Ic2012c943e0ce54eb8662789fb3d4c1b6cd8d606 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2398644 Commit-Queue: Martin Bidlingmaier <mbid@google.com> Reviewed-by: Leszek Swirski <leszeks@chromium.org> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Cr-Commit-Position: refs/heads/master@{#70026}
2020-09-17 16:24:35 +00:00
Test(/asd[fg]$/, "asdf asdg", ["asdg"], 0);
// The global flag.
Test(/asdf/g, "fjasdfkkasdf", ["asdf"], 6);
// The sticky flag.
var r = /asdf/y;
r.lastIndex = 2;
Test(r, "fjasdfkkasdf", ["asdf"], 6);
// The multiline flag.
Test(/^a/m, "x\na", ["a"], 0);
Test(/x$/m, "x\na", ["x"], 0);
// The dotall flag.
Test(/asdf.xyz/s, "asdf\nxyz", ["asdf\nxyz"], 0);