Andy Heninger
14bcaaf58e
ICU-20876 Regex Grapheme Cluster matching with Break Iterators.
...
Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.
The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.
The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.
Also note that this change adds a new dependency on Break Iteration. Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
2020-02-18 18:28:10 -08:00
Andy Heninger
d6b88d49e3
ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text
...
In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.
The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
2020-02-03 16:51:17 -08:00
Andy Heninger
54a60fe6f4
ICU-11548 Improve regex static UnicodeSets handling
...
Compiled regular expression patterns make use of several shared common
UnicodeSets. This change simplifies the creation and use of these
static UnicodeSets.
- Pointer fields to the static sets are removed from the compiled patterns,
and the static variables are accessed directly. The deleted pointers
were a hold-over from earlier code that did not use shared statics.
- The UnicodeSet pattern literals are changed from hex constants to
u"string literals".
- The size of fRuleSets (from regexst.h) is changed from a hard-coded 10
to the number of UnicodeSets actually required. Doing this required
a change to regexcst.pl to export the required size. Changing and
rerunning this perl code resulted in massive but benign changes to
the generated file regexcst.h, the result of perl having changed its
order of enumeration of hashes since the file was last regenerated.
- UnicodeSets are frozen when possible. Should result in faster matching.
2020-01-30 15:13:07 -08:00
Andy Heninger
03937347fb
ICU-20863 Regex, lazy creation and reduced size of map from capture group names to numbers.
2019-10-22 17:23:26 -07:00
Andy Heninger
327087150f
ICU-20618 Regex nested lookaround expressions, clean up active match region handling.
2019-08-19 13:31:34 -07:00
Fredrik Roubert
5d6d29b76a
ICU-20601 Remove superfluous semicolons (-Wextra-semi-stmt).
...
These are the same changes for the C++ code as was done for the C code
by commit 17606e0345
.
2019-08-15 12:30:21 +02:00
Jeff Genovy
5c8960e59e
ICU-20074 Revise UPRV_UNREACHABLE macro to always call abort().
...
Moved the macro from platform.h to uassert.h.
Removed any "unreachable" code that previously occurred after the UPRV_UNREACHABLE macro is used.
Changes based on review from Andy.
Co-authored-by: Daniel Ju <daju@microsoft.com>
2019-01-24 18:50:04 -08:00
Daniel Ju
7453181fff
ICU-20074 Define UPRV_UNREACHABLE macro for unreachable code
...
Replaced occurrences of U_ASSERT(FALSE) with new UPRV_UNREACHABLE macro.
2019-01-14 14:16:26 -08:00
Daniel Ju
b13c951348
ICU-20043 ICU-13214 ICU-13764 MSVC W3 and W4 warning cleanup ( #53 )
...
Cleaned up all of the MSVC W3 warnings and most of the W4 warnings in the common and i18n projects.
2018-09-27 14:27:38 -07:00
Andy Heninger
193aa17f08
ICU-13631 Regex Address Sanitizer fix.
...
X-SVN-Rev: 41086
2018-03-08 18:32:15 +00:00
George Rhoten
a82e70e5b1
ICU-13029 Mark unmodified static data as const
...
X-SVN-Rev: 40065
2017-04-19 18:55:46 +00:00
Andy Heninger
5f57938910
ICU-12884 regex timeout not working with {loop counts} in patterns.
...
X-SVN-Rev: 39693
2017-02-21 23:12:48 +00:00
Andy Heninger
242e02c388
ICU-12764 icu4c utf-8 source files, update Copyright notices.
...
X-SVN-Rev: 39583
2017-01-20 00:20:31 +00:00
Michael Ow
61607c2773
ICU-12564 Update copyright notice in trunk
...
X-SVN-Rev: 38848
2016-06-15 18:58:17 +00:00
Yoshito Umaoka
00ca13e126
ICU-12564 Reverted r38761 and r38762, because we want to prepend the Unicode copyright for existing source files, instead of replacing copyright comments.
...
X-SVN-Rev: 38776
2016-05-31 21:45:07 +00:00
Michael Ow
c9f199a30f
ICU-12564 Update copyright notice in ICU4C
...
X-SVN-Rev: 38761
2016-05-26 22:32:17 +00:00
Fredrik Roubert
7f4b8d106b
ICU-12012 Replace all sizeof p / sizeof *p with UPRV_LENGTHOF().
...
R=markus.icu@gmail.com
Review URL: https://codereview.appspot.com/285520043 .
X-SVN-Rev: 38337
2016-02-23 10:40:09 +00:00
Andy Heninger
8dba7301b7
ICU-11554 Fix regex bug with look-behind matching & UTF-8 input.
...
X-SVN-Rev: 38056
2015-10-09 20:01:46 +00:00
Andy Heninger
9b8403900a
ICU-11507 Regex, fix problem with backrefs of unpaired surrogates.
...
X-SVN-Rev: 38038
2015-10-06 23:59:28 +00:00
Markus Scherer
0f0c7656d3
ICU-11832 some better error checking based on static code analysis
...
X-SVN-Rev: 37930
2015-09-10 01:42:34 +00:00
Andy Heninger
ce09d8a4bc
ICU-11393 Regex, add pattern chars R v and h
...
X-SVN-Rev: 37057
2015-02-24 00:24:59 +00:00
Andy Heninger
ec3f77f878
ICU-5312 Regular Expressions Named Capture.
...
X-SVN-Rev: 37040
2015-02-18 23:56:19 +00:00
Andy Heninger
22c8c94d14
ICU-11469 Regular Expressions, remove old tech preview functions.
...
X-SVN-Rev: 36953
2015-01-14 00:03:29 +00:00
Yoshito Umaoka
ff4fd29a21
ICU-11302 Proper error handling with utext_clone in regular expression implementation.
...
X-SVN-Rev: 36673
2014-10-13 17:46:34 +00:00
Andy Heninger
3dd19f8a5b
ICU-10815 Regex error handling, minor cleanups.
...
X-SVN-Rev: 36306
2014-09-02 21:39:18 +00:00
Andy Heninger
f2dfa7422e
ICU-10815 Fix for uregex_findNext() not setting U_REGEX_STOPPED_BY_CALLER
...
X-SVN-Rev: 36260
2014-08-28 01:19:29 +00:00
Andy Heninger
1ba1ec3b83
ICU-11049 regular expressions, use same logic in UText and (UChar *) code paths when checking limit of potential match start positions.
...
X-SVN-Rev: 36161
2014-08-14 17:44:05 +00:00
Andy Heninger
e03585d7cf
ICU-11049 fix regex find() memory overrun.
...
X-SVN-Rev: 36124
2014-08-06 21:49:08 +00:00
Andy Heninger
10dd7ed47b
ICU-10463 Regular Expressions, rework debug conditionals to fix build failures on clang, and to somewhat simplify.
...
X-SVN-Rev: 34565
2013-10-14 22:11:21 +00:00
Andy Heninger
045919648e
ICU-10459 Fix segfault in uregex_group() when match is in invalid state.
...
X-SVN-Rev: 34559
2013-10-11 20:59:39 +00:00
Andy Heninger
20016a58db
ICU-9719 Regular Expressions, add loop breaking to unbounded {min, max} loops.
...
X-SVN-Rev: 33848
2013-06-26 00:27:11 +00:00
Andy Heninger
b129b40280
ICU-10024 fixes for Regexp, look-behind pattern fails to compile bug.
...
X-SVN-Rev: 33835
2013-06-18 20:38:08 +00:00
Michael Ow
0ca13b73b0
ICU-9292 Merge BEAM warning fixes from branch into trunk
...
X-SVN-Rev: 31792
2012-05-03 05:50:26 +00:00
Andy Heninger
b8315ecf6a
ICU-8826 Regex case insensitive match fixes; also fixes #6074 , hitEnd() sometimes fails.
...
X-SVN-Rev: 31233
2012-01-20 00:50:02 +00:00
Andy Heninger
81618f4a97
ICU-8824 Apply patch to remove read of unitialized memory.
...
X-SVN-Rev: 30688
2011-09-19 20:48:29 +00:00
Markus Scherer
e31ce99b84
ICU-8575 option for not including utf headers by default; replace uses of deprecated utf_old.h macros
...
X-SVN-Rev: 30430
2011-07-27 05:53:56 +00:00
Steven R. Loomis
b209c07620
ICU-8687 warnings fix (clang)
...
X-SVN-Rev: 30285
2011-07-06 20:05:38 +00:00
Andy Heninger
8148726df2
ICU-8404 Regular Expressions split(), fix incorrect handling of trailing empty field
...
X-SVN-Rev: 29801
2011-04-15 00:48:39 +00:00
Andy Heninger
4a932a4dfc
ICU-7795 Regular Expression refreshInputText added.
...
X-SVN-Rev: 29363
2011-01-27 01:12:12 +00:00
Michael Grady
cfe1b1d92f
ICU-8104 Move REGEXFINDPROGRESS_INTERRUPT macro out of regex.h into implementation
...
X-SVN-Rev: 28961
2010-10-30 21:49:29 +00:00
Andy Heninger
69a7b1ea42
ICU-6545 Allow regex reset to the end of a string
...
X-SVN-Rev: 28801
2010-10-12 22:52:51 +00:00
Markus Scherer
b5e1330176
ICU-7264 merge Unicode 6.0 into trunk from branches/markus/uni60 -r 28339:28657
...
X-SVN-Rev: 28661
2010-09-21 00:12:49 +00:00
Michael Grady
751473d25a
ICU-7813 64bit regex API, 7675: UText-based Regex to use native indexes, 7764: Improved UText-regex API error handling, 7855: UText regex group API returns shallow clone, 7851: Set region and start position, 7763: Inline regex progress callback function.
...
X-SVN-Rev: 28647
2010-09-18 03:07:17 +00:00
Andy Heninger
74e297291f
ICU-7780 fix various compiler warnings
...
X-SVN-Rev: 28301
2010-07-12 18:03:29 +00:00
Andy Heninger
e8985e068a
ICU-7740 Regex error handling cleanups
...
X-SVN-Rev: 28210
2010-06-16 17:54:04 +00:00
Michael Grady
8b6ce73317
ICU-7666 Regex find progress callback (from branch to trunk)
...
X-SVN-Rev: 28201
2010-06-15 20:19:10 +00:00
Michael Ow
7058d05758
ICU-7651 Fix incorrect size calculation of fSmallData in regex.
...
X-SVN-Rev: 27992
2010-04-28 02:32:00 +00:00
Peter Edberg
829d15c795
ICU-7538 Avoid negative array index when UTEXT_NEXT32 returns -1 (U_SENTINEL)
...
X-SVN-Rev: 27821
2010-03-11 19:54:36 +00:00
Peter Edberg
411a93712a
ICU-4521 Merge from branch, fixes for 64bit alignment & type conversions
...
X-SVN-Rev: 27670
2010-02-25 06:33:29 +00:00
Michael Ow
0763686c6c
ICU-7370 Log data errors to ensure that intltest and cintltst passes without data.
...
X-SVN-Rev: 27649
2010-02-24 16:17:03 +00:00