Commit Graph

1 Commits

Author SHA1 Message Date
Iain Ireland
3fab9d05cf [regexp] Fix and unify non-unicode case-folding algorithms
Non-unicode, case-insensitive regexps (e.g. /foo/i, not foo/iu) use a
case-folding algorithm that doesn't quite match the Unicode
definition. There are two places in irregexp that need to do
case-folding. Prior to this patch, neither of them quite matched the
spec (https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch).

This patch implements the "Canonicalize" algorithm in
src/regexp/special-case.h, and uses it in the relevant places. It
replaces special-case logic around upper-casing / ASCII characters
with the following approach:

1. For most characters, calling UnicodeSet::closeOver on a set
   containing that character will produce the correct set of
   case-insensitive matches.

2. For a small handful of characters (like the sharp S that prompted
   this change), UnicodeSet::closeOver will include some characters
   that should be omitted. For example, although closeOver('ß') =
   "ßẞ", uppercase('ß') is "SS", so step 3.e means that 'ß'
   canonicalizes to itself, and should not match 'ẞ'. In these cases,
   we can skip the closeOver entirely, because it will never add an
   equivalent character. These characters are in the IgnoreSet.

3. For an even smaller handful of characters, UnicodeSet::closeOver
   will produce some characters that should be omitted, but also some
   characters that should be included. For example, closeOver('k') =
   "kKK" (lowercase k, uppercase K, U+212A KELVIN SIGN), but KELVIN
   SIGN should not match either of the other two (step 3.g). To handle
   this, we put such characters in the SpecialAddSet. In these cases,
   we closeOver the original character, but filter out the results
   that do not have the same canonical value.

The computation of IgnoreSet and SpecialAddSet happens at build time,
using the pre-existing gen-regexp-special-case.cc step.

R=jgruber@chromium.org

Bug: v8:10248
Change-Id: I00d48b180c83bb8e645cc59eda57b01eab134f0b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2072858
Reviewed-by: Frank Tang <ftang@chromium.org>
Reviewed-by: Jakob Gruber <jgruber@chromium.org>
Commit-Queue: Jakob Gruber <jgruber@chromium.org>
Cr-Commit-Position: refs/heads/master@{#66641}
2020-03-10 11:09:28 +00:00