[regexp] Fix wrong match of lone surrogates

A surrogate pair split by an "always succeeding" backreference
(backreference capturing undefined, because it hasn't captured anything
yet) was incorrectly combined into a surrogate pair, resulting in
incorrect matches.

Bug: v8:13410
Change-Id: I2faf9ca5f9fcfd55cd6933a1ea038c88f8d3f524
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/4013159
Commit-Queue: Patrick Thier <pthier@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Cr-Commit-Position: refs/heads/main@{#84276}
This commit is contained in:
pthier 2022-11-08 13:29:13 +01:00 committed by V8 LUCI CQ
parent 9174d25829
commit dd92fe999b
2 changed files with 19 additions and 1 deletions

View File

@ -2627,7 +2627,10 @@ void RegExpBuilder::AddEscapedUnicodeCharacter(base::uc32 character) {
FlushPendingSurrogate();
}
void RegExpBuilder::AddEmpty() { pending_empty_ = true; }
void RegExpBuilder::AddEmpty() {
FlushPendingSurrogate();
pending_empty_ = true;
}
void RegExpBuilder::AddClassRanges(RegExpClassRanges* cc) {
if (NeedsDesugaringForUnicode(cc)) {

View File

@ -0,0 +1,15 @@
// Copyright 2022 the V8 project authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
// Make sure lone surrogates don't match combined surrogates.
assertFalse(/[\ud800-\udfff]+/u.test('\ud801\udc0f'));
// Surrogate pairs split by an "always succeeding" backref shouldn't match
// combined surrogates.
assertFalse(/(\ud801\1\udc0f)/u.test('\ud801\udc0f'));
assertFalse(/(\ud801\1?\udc0f)/u.test('\ud801\udc0f'));
assertFalse(/(\ud801\1{0}\udc0f)/u.test('\ud801\udc0f'));
assertFalse(new RegExp('(\ud801\\1\udc0f)','u').test('\ud801\udc0f'));
assertFalse(new RegExp('(\ud801\\1?\udc0f)','u').test('\ud801\udc0f'));
assertFalse(new RegExp('(\ud801\\1{0}\udc0f)','u').test('\ud801\udc0f'));