Commit Graph

135 Commits

Author SHA1 Message Date
Andy Heninger
14bcaaf58e ICU-20876 Regex Grapheme Cluster matching with Break Iterators.
Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.

The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.

The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.

Also note that this change adds a new dependency on Break Iteration.  Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
2020-02-18 18:28:10 -08:00
Andy Heninger
d6b88d49e3 ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text
In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.

The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
2020-02-03 16:51:17 -08:00
Andy Heninger
54a60fe6f4 ICU-11548 Improve regex static UnicodeSets handling
Compiled regular expression patterns make use of several shared common
UnicodeSets. This change simplifies the creation and use of these
static UnicodeSets.

- Pointer fields to the static sets are removed from the compiled patterns,
  and the static variables are accessed directly. The deleted pointers
  were a hold-over from earlier code that did not use shared statics.

- The UnicodeSet pattern literals are changed from hex constants to
  u"string literals".

- The size of fRuleSets (from regexst.h) is changed from a hard-coded 10
  to the number of UnicodeSets actually required. Doing this required
  a change to regexcst.pl to export the required size. Changing and
  rerunning this perl code resulted in massive but benign changes to
  the generated file regexcst.h, the result of perl having changed its
  order of enumeration of hashes since the file was last regenerated.

- UnicodeSets are frozen when possible. Should result in faster matching.
2020-01-30 15:13:07 -08:00
Andy Heninger
03937347fb ICU-20863 Regex, lazy creation and reduced size of map from capture group names to numbers. 2019-10-22 17:23:26 -07:00
Andy Heninger
327087150f ICU-20618 Regex nested lookaround expressions, clean up active match region handling. 2019-08-19 13:31:34 -07:00
Fredrik Roubert
5d6d29b76a ICU-20601 Remove superfluous semicolons (-Wextra-semi-stmt).
These are the same changes for the C++ code as was done for the C code
by commit 17606e0345.
2019-08-15 12:30:21 +02:00
Jeff Genovy
5c8960e59e ICU-20074 Revise UPRV_UNREACHABLE macro to always call abort().
Moved the macro from platform.h to uassert.h.
Removed any "unreachable" code that previously occurred after the UPRV_UNREACHABLE macro is used.
Changes based on review from Andy.

Co-authored-by: Daniel Ju <daju@microsoft.com>
2019-01-24 18:50:04 -08:00
Daniel Ju
7453181fff ICU-20074 Define UPRV_UNREACHABLE macro for unreachable code
Replaced occurrences of U_ASSERT(FALSE) with new UPRV_UNREACHABLE macro.
2019-01-14 14:16:26 -08:00
Daniel Ju
b13c951348
ICU-20043 ICU-13214 ICU-13764 MSVC W3 and W4 warning cleanup (#53)
Cleaned up all of the MSVC W3 warnings and most of the W4 warnings in the common and i18n projects.
2018-09-27 14:27:38 -07:00
Andy Heninger
193aa17f08 ICU-13631 Regex Address Sanitizer fix.
X-SVN-Rev: 41086
2018-03-08 18:32:15 +00:00
George Rhoten
a82e70e5b1 ICU-13029 Mark unmodified static data as const
X-SVN-Rev: 40065
2017-04-19 18:55:46 +00:00
Andy Heninger
5f57938910 ICU-12884 regex timeout not working with {loop counts} in patterns.
X-SVN-Rev: 39693
2017-02-21 23:12:48 +00:00
Andy Heninger
242e02c388 ICU-12764 icu4c utf-8 source files, update Copyright notices.
X-SVN-Rev: 39583
2017-01-20 00:20:31 +00:00
Michael Ow
61607c2773 ICU-12564 Update copyright notice in trunk
X-SVN-Rev: 38848
2016-06-15 18:58:17 +00:00
Yoshito Umaoka
00ca13e126 ICU-12564 Reverted r38761 and r38762, because we want to prepend the Unicode copyright for existing source files, instead of replacing copyright comments.
X-SVN-Rev: 38776
2016-05-31 21:45:07 +00:00
Michael Ow
c9f199a30f ICU-12564 Update copyright notice in ICU4C
X-SVN-Rev: 38761
2016-05-26 22:32:17 +00:00
Fredrik Roubert
7f4b8d106b ICU-12012 Replace all sizeof p / sizeof *p with UPRV_LENGTHOF().
R=markus.icu@gmail.com

Review URL: https://codereview.appspot.com/285520043 .

X-SVN-Rev: 38337
2016-02-23 10:40:09 +00:00
Andy Heninger
8dba7301b7 ICU-11554 Fix regex bug with look-behind matching & UTF-8 input.
X-SVN-Rev: 38056
2015-10-09 20:01:46 +00:00
Andy Heninger
9b8403900a ICU-11507 Regex, fix problem with backrefs of unpaired surrogates.
X-SVN-Rev: 38038
2015-10-06 23:59:28 +00:00
Markus Scherer
0f0c7656d3 ICU-11832 some better error checking based on static code analysis
X-SVN-Rev: 37930
2015-09-10 01:42:34 +00:00
Andy Heninger
ce09d8a4bc ICU-11393 Regex, add pattern chars R v and h
X-SVN-Rev: 37057
2015-02-24 00:24:59 +00:00
Andy Heninger
ec3f77f878 ICU-5312 Regular Expressions Named Capture.
X-SVN-Rev: 37040
2015-02-18 23:56:19 +00:00
Andy Heninger
22c8c94d14 ICU-11469 Regular Expressions, remove old tech preview functions.
X-SVN-Rev: 36953
2015-01-14 00:03:29 +00:00
Yoshito Umaoka
ff4fd29a21 ICU-11302 Proper error handling with utext_clone in regular expression implementation.
X-SVN-Rev: 36673
2014-10-13 17:46:34 +00:00
Andy Heninger
3dd19f8a5b ICU-10815 Regex error handling, minor cleanups.
X-SVN-Rev: 36306
2014-09-02 21:39:18 +00:00
Andy Heninger
f2dfa7422e ICU-10815 Fix for uregex_findNext() not setting U_REGEX_STOPPED_BY_CALLER
X-SVN-Rev: 36260
2014-08-28 01:19:29 +00:00
Andy Heninger
1ba1ec3b83 ICU-11049 regular expressions, use same logic in UText and (UChar *) code paths when checking limit of potential match start positions.
X-SVN-Rev: 36161
2014-08-14 17:44:05 +00:00
Andy Heninger
e03585d7cf ICU-11049 fix regex find() memory overrun.
X-SVN-Rev: 36124
2014-08-06 21:49:08 +00:00
Andy Heninger
10dd7ed47b ICU-10463 Regular Expressions, rework debug conditionals to fix build failures on clang, and to somewhat simplify.
X-SVN-Rev: 34565
2013-10-14 22:11:21 +00:00
Andy Heninger
045919648e ICU-10459 Fix segfault in uregex_group() when match is in invalid state.
X-SVN-Rev: 34559
2013-10-11 20:59:39 +00:00
Andy Heninger
20016a58db ICU-9719 Regular Expressions, add loop breaking to unbounded {min, max} loops.
X-SVN-Rev: 33848
2013-06-26 00:27:11 +00:00
Andy Heninger
b129b40280 ICU-10024 fixes for Regexp, look-behind pattern fails to compile bug.
X-SVN-Rev: 33835
2013-06-18 20:38:08 +00:00
Michael Ow
0ca13b73b0 ICU-9292 Merge BEAM warning fixes from branch into trunk
X-SVN-Rev: 31792
2012-05-03 05:50:26 +00:00
Andy Heninger
b8315ecf6a ICU-8826 Regex case insensitive match fixes; also fixes #6074, hitEnd() sometimes fails.
X-SVN-Rev: 31233
2012-01-20 00:50:02 +00:00
Andy Heninger
81618f4a97 ICU-8824 Apply patch to remove read of unitialized memory.
X-SVN-Rev: 30688
2011-09-19 20:48:29 +00:00
Markus Scherer
e31ce99b84 ICU-8575 option for not including utf headers by default; replace uses of deprecated utf_old.h macros
X-SVN-Rev: 30430
2011-07-27 05:53:56 +00:00
Steven R. Loomis
b209c07620 ICU-8687 warnings fix (clang)
X-SVN-Rev: 30285
2011-07-06 20:05:38 +00:00
Andy Heninger
8148726df2 ICU-8404 Regular Expressions split(), fix incorrect handling of trailing empty field
X-SVN-Rev: 29801
2011-04-15 00:48:39 +00:00
Andy Heninger
4a932a4dfc ICU-7795 Regular Expression refreshInputText added.
X-SVN-Rev: 29363
2011-01-27 01:12:12 +00:00
Michael Grady
cfe1b1d92f ICU-8104 Move REGEXFINDPROGRESS_INTERRUPT macro out of regex.h into implementation
X-SVN-Rev: 28961
2010-10-30 21:49:29 +00:00
Andy Heninger
69a7b1ea42 ICU-6545 Allow regex reset to the end of a string
X-SVN-Rev: 28801
2010-10-12 22:52:51 +00:00
Markus Scherer
b5e1330176 ICU-7264 merge Unicode 6.0 into trunk from branches/markus/uni60 -r 28339:28657
X-SVN-Rev: 28661
2010-09-21 00:12:49 +00:00
Michael Grady
751473d25a ICU-7813 64bit regex API, 7675: UText-based Regex to use native indexes, 7764: Improved UText-regex API error handling, 7855: UText regex group API returns shallow clone, 7851: Set region and start position, 7763: Inline regex progress callback function.
X-SVN-Rev: 28647
2010-09-18 03:07:17 +00:00
Andy Heninger
74e297291f ICU-7780 fix various compiler warnings
X-SVN-Rev: 28301
2010-07-12 18:03:29 +00:00
Andy Heninger
e8985e068a ICU-7740 Regex error handling cleanups
X-SVN-Rev: 28210
2010-06-16 17:54:04 +00:00
Michael Grady
8b6ce73317 ICU-7666 Regex find progress callback (from branch to trunk)
X-SVN-Rev: 28201
2010-06-15 20:19:10 +00:00
Michael Ow
7058d05758 ICU-7651 Fix incorrect size calculation of fSmallData in regex.
X-SVN-Rev: 27992
2010-04-28 02:32:00 +00:00
Peter Edberg
829d15c795 ICU-7538 Avoid negative array index when UTEXT_NEXT32 returns -1 (U_SENTINEL)
X-SVN-Rev: 27821
2010-03-11 19:54:36 +00:00
Peter Edberg
411a93712a ICU-4521 Merge from branch, fixes for 64bit alignment & type conversions
X-SVN-Rev: 27670
2010-02-25 06:33:29 +00:00
Michael Ow
0763686c6c ICU-7370 Log data errors to ensure that intltest and cintltst passes without data.
X-SVN-Rev: 27649
2010-02-24 16:17:03 +00:00