Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.
The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.
The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.
Also note that this change adds a new dependency on Break Iteration. Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.
The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
- equivalent but originally unequal
- locale distance shifted left for additional fraction bits with micro distance
- Java more verbose matcher debug output
See #949
- Merge the look-ahead results slots used when multiple rules share a common accepting state.
- Sequentially number the look-ahead result slot. Will eventually allow replacing the runtime map with an array.
- Inhibit chaining out of look-ahead rules. This could never actually happen; when a hard break
rule matches, the engine is stopped immediately, but the state table was being constructed
as if it could happen. Reduces table size for line break rules.
- Remove incorrect handling of fAccepting and fLookAhead fields of a state table row
when removing duplicate states. Look-ahead slot number was being mis-interpreted as a state number.
See PR #915, see changes.txt
- Unicode 13 beta data as of 2019-nov-21
- uprops.icu format version 7.7 with more bits for Script/Script_Extensions
- more bits in spoof checker ScriptSet
- root line break rules adjusted for UAX 14 changes, from Andy
- line break tailorings not yet in sync with root
- added PKGDATA_TRAILING_SPACE to all of the pkgdataMakefile.in file.
- NOTE: Users who create their own pkgdata.inc / icupkg.inc files may need
to recreate this PKGDATA_TRAILING_SPACE behavior.
- used the above variable, normally undefined, in mh-* files that need a trailing space
- Also, fixed use of system() in pkgdata.cpp per ICU-20538
This was causing pkgdata to return a zero status even on clang
failure, masking this issue.
(cherry picked from commit 83a0542b5b)
- added PKGDATA_TRAILING_SPACE to all of the pkgdataMakefile.in file.
- NOTE: Users who create their own pkgdata.inc / icupkg.inc files may need
to recreate this PKGDATA_TRAILING_SPACE behavior.
- used the above variable, normally undefined, in mh-* files that need a trailing space
- Also, fixed use of system() in pkgdata.cpp per ICU-20538
This was causing pkgdata to return a zero status even on clang
failure, masking this issue.
- see also ICU-20062
- add a `-B` option to the two python invocations on Windows
- set PYTHONDONTWRITEBYTECODE in configure.ac and icudefs.mk.in
Co-authored-by: Fredrik Roubert <roubert@google.com>